JSON to pandas DataFrame

splitlines | StackOverflow

What I am trying to do is extract elevation data from a google maps API along a path specified by latitude and longitude coordinates as follows:

from urllib2 import Request, urlopen
import json

path1 = "42.974049,-81.205203|42.974298,-81.195755"
request=Request("http://maps.googleapis.com/maps/api/elevation/json?locations="+path1+"&sensor=false")
response = urlopen(request)
elevations = response.read()

This gives me a data that looks like this:

elevations.splitlines()

["{",
 "   "results" : [",
 "      {",
 "         "elevation" : 243.3462677001953,",
 "         "location" : {",
 "            "lat" : 42.974049,",
 "            "lng" : -81.205203",
 "         },",
 "         "resolution" : 19.08790397644043",
 "      },",
 "      {",
 "         "elevation" : 244.1318664550781,",
 "         "location" : {",
 "            "lat" : 42.974298,",
 "            "lng" : -81.19575500000001",
 "         },",
 "         "resolution" : 19.08790397644043",
 "      }",
 "   ],",
 "   "status" : "OK"",
 "}"]

when putting into as DataFrame here is what I get:

enter image description here

pd.read_json(elevations)

and here is what I want:

enter image description here

I"m not sure if this is possible, but mainly what I am looking for is a way to be able to put the elevation, latitude and longitude data together in a pandas dataframe (doesn"t have to have fancy mutiline headers).

If any one can help or give some advice on working with this data that would be great! If you can"t tell I haven"t worked much with json data before...

EDIT:

This method isn"t all that attractive but seems to work:

data = json.loads(elevations)
lat,lng,el = [],[],[]
for result in data["results"]:
    lat.append(result[u"location"][u"lat"])
    lng.append(result[u"location"][u"lng"])
    el.append(result[u"elevation"])
df = pd.DataFrame([lat,lng,el]).T

ends up dataframe having columns latitude, longitude, elevation

enter image description here

Answer rating: 242

I found a quick and easy solution to what I wanted using json_normalize() included in pandas 1.01.

from urllib2 import Request, urlopen
import json

import pandas as pd    

path1 = "42.974049,-81.205203|42.974298,-81.195755"
request=Request("http://maps.googleapis.com/maps/api/elevation/json?locations="+path1+"&sensor=false")
response = urlopen(request)
elevations = response.read()
data = json.loads(elevations)
df = pd.json_normalize(data["results"])

This gives a nice flattened dataframe with the json data that I got from the Google Maps API.





JSON to pandas DataFrame: StackOverflow Questions

Answer #1

According to Python"s Methods of File Objects, the simplest way to convert a text file into a list is:

with open("file.txt") as f:
    my_list = list(f)
    # my_list = [x.rstrip() for x in f] # remove line breaks

If you just need to iterate over the text file lines, you can use:

with open("file.txt") as f:
    for line in f:
       ...

Old answer:

Using with and readlines() :

with open("file.txt") as f:
    lines = f.readlines()

If you don"t care about closing the file, this one-liner works:

lines = open("file.txt").readlines()

The traditional way:

f = open("file.txt") # Open file on read mode
lines = f.read().splitlines() # List with stripped line-breaks
f.close() # Close file

Answer #2

Having a Text file content:

line 1
line 2
line 3

We can use this Python script in the same directory of the txt above

>>> with open("myfile.txt", encoding="utf-8") as file:
...     x = [l.rstrip("
") for l in file]
>>> x
["line 1","line 2","line 3"]

Using append:

x = []
with open("myfile.txt") as file:
    for l in file:
        x.append(l.strip())

Or:

>>> x = open("myfile.txt").read().splitlines()
>>> x
["line 1", "line 2", "line 3"]

Or:

>>> x = open("myfile.txt").readlines()
>>> x
["linea 1
", "line 2
", "line 3
"]

Or:

def print_output(lines_in_textfile):
    print("lines_in_textfile =", lines_in_textfile)

y = [x.rstrip() for x in open("001.txt")]
print_output(y)

with open("001.txt", "r", encoding="utf-8") as file:
    file = file.read().splitlines()
    print_output(file)

with open("001.txt", "r", encoding="utf-8") as file:
    file = [x.rstrip("
") for x in file]
    print_output(file)

output:

lines_in_textfile = ["line 1", "line 2", "line 3"]
lines_in_textfile = ["line 1", "line 2", "line 3"]
lines_in_textfile = ["line 1", "line 2", "line 3"]

Answer #3

Things have changed quite a bit since 2010 when this was posted and I haven"t tried all the other answers but I have tried a few, and I found this to work the best for me using python3.6.

I was able to fetch about ~150 unique domains per second running on AWS.

import concurrent.futures
import requests
import time

out = []
CONNECTIONS = 100
TIMEOUT = 5

tlds = open("../data/sample_1k.txt").read().splitlines()
urls = ["http://{}".format(x) for x in tlds[1:]]

def load_url(url, timeout):
    ans = requests.head(url, timeout=timeout)
    return ans.status_code

with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
    future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls)
    time1 = time.time()
    for future in concurrent.futures.as_completed(future_to_url):
        try:
            data = future.result()
        except Exception as exc:
            data = str(type(exc))
        finally:
            out.append(data)

            print(str(len(out)),end="
")

    time2 = time.time()

print(f"Took {time2-time1:.2f} s")

Answer #4

You can read the whole file and split lines using str.splitlines:

temp = file.read().splitlines()

Or you can strip the newline by hand:

temp = [line[:-1] for line in file]

Note: this last solution only works if the file ends with a newline, otherwise the last line will lose a character.

This assumption is true in most cases (especially for files created by text editors, which often do add an ending newline anyway).

If you want to avoid this you can add a newline at the end of file:

with open(the_file, "r+") as f:
    f.seek(-1, 2)  # go at the end of the file
    if f.read(1) != "
":
        # add missing newline if not already present
        f.write("
")
        f.flush()
        f.seek(0)
    lines = [line[:-1] for line in f]

Or a simpler alternative is to strip the newline instead:

[line.rstrip("
") for line in file]

Or even, although pretty unreadable:

[line[:-(line[-1] == "
") or len(line)+1] for line in file]

Which exploits the fact that the return value of or isn"t a boolean, but the object that was evaluated true or false.


The readlines method is actually equivalent to:

def readlines(self):
    lines = []
    for line in iter(self.readline, ""):
        lines.append(line)
    return lines

# or equivalently

def readlines(self):
    lines = []
    while True:
        line = self.readline()
        if not line:
            break
        lines.append(line)
    return lines

Since readline() keeps the newline also readlines() keeps it.

Note: for symmetry to readlines() the writelines() method does not add ending newlines, so f2.writelines(f.readlines()) produces an exact copy of f in f2.

Answer #5

You probably want to line up with the """

def foo():
    string = """line one
             line two
             line three"""

Since the newlines and spaces are included in the string itself, you will have to postprocess it. If you don"t want to do that and you have a whole lot of text, you might want to store it separately in a text file. If a text file does not work well for your application and you don"t want to postprocess, I"d probably go with

def foo():
    string = ("this is an "
              "implicitly joined "
              "string")

If you want to postprocess a multiline string to trim out the parts you don"t need, you should consider the textwrap module or the technique for postprocessing docstrings presented in PEP 257:

def trim(docstring):
    if not docstring:
        return ""
    # Convert tabs to spaces (following the normal Python rules)
    # and split into a list of lines:
    lines = docstring.expandtabs().splitlines()
    # Determine minimum indentation (first line doesn"t count):
    indent = sys.maxint
    for line in lines[1:]:
        stripped = line.lstrip()
        if stripped:
            indent = min(indent, len(line) - len(stripped))
    # Remove indentation (first line is special):
    trimmed = [lines[0].strip()]
    if indent < sys.maxint:
        for line in lines[1:]:
            trimmed.append(line[indent:].rstrip())
    # Strip off trailing and leading blank lines:
    while trimmed and not trimmed[-1]:
        trimmed.pop()
    while trimmed and not trimmed[0]:
        trimmed.pop(0)
    # Return a single string:
    return "
".join(trimmed)

Answer #6

inputString.splitlines()

Will give you a list with each item, the splitlines() method is designed to split each line into a list element.

Answer #7

This should do what you want (file contents in a list, by line, without )

with open(filename) as f:
    mylist = f.read().splitlines() 

Answer #8

with open("C:/path/numbers.txt") as f:
    lines = f.read().splitlines()

this will give you a list of values (strings) you had in your file, with newlines stripped.

also, watch your backslashes in windows path names, as those are also escape chars in strings. You can use forward slashes or double backslashes instead.

Answer #9

Question: I am using split(" ") to get lines in one string, and found that "".split() returns an empty list, [], while "".split(" ") returns [""].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as ) will produce the first empty field. Consider if you had written " ".split(" "), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = """
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
"""
>>> for line in data.splitlines():
        print(line.split())

["Shasta", "California", "14,200"]
["McKinley", "Alaska", "20,300"]
["Fuji", "Japan", "12,400"]

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = """
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
"""
>>> for line in data.splitlines():
        print(line.split(","))

["Guido", "BDFL", "", "Amsterdam"]
["Barry", "FLUFL", "", "USA"]
["Tim", "", "", "USA"]

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python"s str.split(delimiter) method:

>>> "".split(",")       # No cuts
[""]
>>> ",".split(",")      # One cut
["", ""]
>>> ",,".split(",")     # Two cuts
["", "", ""]

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the . If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = """
Line 1
Line 2
Line 3
Line 4"""

>>> data.count("
")                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count("
") + (not data.endswith("
"))   # Accurate and fast
4    

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn"t "terrible" either. For the most part, Guido"s API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

When asked to break these strings into fields, people tend to describe both using the same English word, "split". When asked to read code such as fields = line.split() or fields = line.split(","), people tend to correctly interpret the statements as "splits a line into fields".

Microsoft Excel"s text-to-columns tool made a similar API choice and incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.

Answer #10

str.splitlines method should give you exactly that.

>>> data = """a,b,c
... d,e,f
... g,h,i
... j,k,l"""
>>> data.splitlines()
["a,b,c", "d,e,f", "g,h,i", "j,k,l"]

Get Solution for free from DataCamp guru