Different Ways to Import CSV File into Pandas

File handling | import csv | Python Methods and Functions

Let's see how you can import a CSV file into Pandas.

Method # 1: Using the read_csv () method.

# pandas module import

import pandas as pd 

 
# create data frame

df = pd.read_csv ( " https://media.python.engineering/wp-content/uploads /nba.csv "

  

df.head ( 10

Output:

Provide file_path .

# import pandas as pd

import pandas as pd

 
# Accepts file folder

filepath = r "C: Gfgdatasets ba.csv"

  
# read CSV file

df = pd.read_csv (filepath)

 
# print first five lines

print (df.head ())

Output:

Method # 2: Using the csv module.

You can directly import the csv files using the module csv .

# import the module into csv

import csv

import pandas as pd

 
# open csv file

with open (r " C: UsersAdminDownloads ba.csv " ) as csv_file: 

 

# read the csv file

csv_reader = csv.reader (csv_file, delimiter = ' , ' )

  

# We can now use these CSV files in pandas

df = pd.DataFrame ([csv_reader], index = None )

  df.head ()

  
# repeating the values ​​of the first column

for val in list (df [ 1 ]):

print (val)

Output:





Different Ways to Import CSV File into Pandas: StackOverflow Questions

Python import csv to list

I have a CSV file with about 2000 records.

Each record has a string, and a category to it:

This is the first line,Line1
This is the second line,Line2
This is the third line,Line3

I need to read this file into a list that looks like this:

data = [("This is the first line", "Line1"),
        ("This is the second line", "Line2"),
        ("This is the third line", "Line3")]

How can import this CSV to the list I need using Python?

Import CSV file as a pandas DataFrame

What"s the Python way to read in a CSV file into a pandas DataFrame (which I can then use for statistical operations, can have differently-typed columns, etc.)?

My CSV file "value.txt" has the following content:

Date,"price";"factor_1";"factor_2"
2012-06-11,1600.20,1.255,1.548
2012-06-12,1610.02,1.258,1.554
2012-06-13,1618.07,1.249,1.552
2012-06-14,1624.40,1.253,1.556
2012-06-15,1626.15,1.258,1.552
2012-06-16,1626.15,1.263,1.558
2012-06-17,1626.15,1.264,1.572

In R we would read this file in using:

price <- read.csv("value.txt")  

and that would return an R data.frame:

> price <- read.csv("value.txt")
> price
     Date   price factor_1 factor_2
1  2012-06-11 1600.20    1.255    1.548
2  2012-06-12 1610.02    1.258    1.554
3  2012-06-13 1618.07    1.249    1.552
4  2012-06-14 1624.40    1.253    1.556
5  2012-06-15 1626.15    1.258    1.552
6  2012-06-16 1626.15    1.263    1.558
7  2012-06-17 1626.15    1.264    1.572

Is there a Pythonic way to get the same functionality?

Answer #1

tl;dr / quick fix

  • Don"t decode/encode willy nilly
  • Don"t assume your strings are UTF-8 encoded
  • Try to convert strings to Unicode strings as soon as possible in your code
  • Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
  • Don"t be tempted to use quick reload hacks

Unicode Zen in Python 2.x - The Long Version

Without seeing the source it"s difficult to know the root cause, so I"ll have to speak generally.

UnicodeDecodeError: "ascii" codec can"t decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can"t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u"my ünicôdé strįng"
>>> type(my_u)
<type "unicode">

Unicode strings may also come from file, databases and network modules. When this happens, you don"t need to worry about the encoding.

Gotchas

Conversion from str to Unicode can happen even when you don"t explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode("€")

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format("€")

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: %s" % "€"

# Append string to Unicode
# Python will try to convert string to Unicode first
u"The currency is: " + "€"         

Examples

In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it"s no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull: Diagram of a string being converted to a Python Unicode string

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can"t contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

Diagram of a string being converted to a Python Unicode string with the wrong encoding

The Unicode Sandwich

It"s good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

Input / Decode

Source code

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u"Zürich"

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as "UTF-8", you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Files

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can"t be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read() 

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you"ve probably used the wrong encoding value.

CSV Files

The Python 2.7 CSV module does not support non-ASCII characters üò©. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Databases

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQL

In the connection string add:

charset="utf8",
use_unicode=True

E.g.

>>> db = MySQLdb.connect(host="localhost", user="root", passwd="passwd", db="sandbox", use_unicode=True, charset="utf8")
PostgreSQL

Add:

psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

HTTP

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

Manually

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you"ve probably got the wrong encoding.

The meat of the sandwich

Work with Unicodes as you would normal strs.

Output

stdout / printing

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console"s encoding. For example, if a Linux shell"s locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

Files

Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.

Database

The same configuration for reading will allow Unicodes to be written directly.

Python 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.

The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people"s Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

Why you shouldn"t use sys.setdefaultencoding("utf8")

It"s a nasty hack (there"s a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

Answer #2

General way:

##text=List of strings to be written to file
with open("csvfile.csv","wb") as file:
    for line in text:
        file.write(line)
        file.write("
")

OR

Using CSV writer :

import csv
with open(<path to output_csv>, "wb") as csv_file:
        writer = csv.writer(csv_file, delimiter=",")
        for line in data:
            writer.writerow(line)

OR

Simplest way:

f = open("csvfile.csv","w")
f.write("hi there
") #Give your csv text here.
## Python will convert 
 to os.linesep
f.close()

Answer #3

I prefer this solution using the csv module from the standard library and the with statement to avoid leaving the file open.

The key point is using "a" for appending when you open the file.

import csv   
fields=["first","second","third"]
with open(r"name", "a") as f:
    writer = csv.writer(f)
    writer.writerow(fields)

If you are using Python 2.7 you may experience superfluous new lines in Windows. You can try to avoid them using "ab" instead of "a" this will, however, cause you TypeError: a bytes-like object is required, not 'str' in python and CSV in Python 3.6. Adding the newline="", as Natacha suggests, will cause you a backward incompatibility between Python 2 and 3.

Answer #4

Another way of solving this is to use the DictReader class, which "skips" the header row and uses it to allowed named indexing.

Given "foo.csv" as follows:

FirstColumn,SecondColumn
asdf,1234
qwer,5678

Use DictReader like this:

import csv
with open("foo.csv") as f:
    reader = csv.DictReader(f, delimiter=",")
    for row in reader:
        print(row["FirstColumn"])  # Access by column header instead of column number
        print(row["SecondColumn"])

Answer #5

The reason it is throwing that exception is because you have the argument rb, which opens the file in binary mode. Change that to r, which will by default open the file in text mode.

Your code:

import csv
ifile  = open("sample.csv", "rb")
read = csv.reader(ifile)
for row in read :
    print (row) 

New code:

import csv
ifile  = open("sample.csv", "r")
read = csv.reader(ifile)
for row in read :
    print (row)

Answer #6

I timed the

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus

import csv
import numpy as np
with open(dest_file,"r") as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = """)
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.

Answer #7

2018-10-29 EDIT

Thank you for the comments.

I tested several kinds of code to get the number of lines in a csv file in terms of speed. The best method is below.

with open(filename) as f:
    sum(1 for line in f)

Here is the code tested.

import timeit
import csv
import pandas as pd

filename = "./sample_submission.csv"

def talktime(filename, funcname, func):
    print(f"# {funcname}")
    t = timeit.timeit(f"{funcname}("{filename}")", setup=f"from __main__ import {funcname}", number = 100) / 100
    print("Elapsed time : ", t)
    print("n = ", func(filename))
    print("
")

def sum1forline(filename):
    with open(filename) as f:
        return sum(1 for line in f)
talktime(filename, "sum1forline", sum1forline)

def lenopenreadlines(filename):
    with open(filename) as f:
        return len(f.readlines())
talktime(filename, "lenopenreadlines", lenopenreadlines)

def lenpd(filename):
    return len(pd.read_csv(filename)) + 1
talktime(filename, "lenpd", lenpd)

def csvreaderfor(filename):
    cnt = 0
    with open(filename) as f:
        cr = csv.reader(f)
        for row in cr:
            cnt += 1
    return cnt
talktime(filename, "csvreaderfor", csvreaderfor)

def openenum(filename):
    cnt = 0
    with open(filename) as f:
        for i, line in enumerate(f,1):
            cnt += 1
    return cnt
talktime(filename, "openenum", openenum)

The result was below.

# sum1forline
Elapsed time :  0.6327946722068599
n =  2528244


# lenopenreadlines
Elapsed time :  0.655304473598555
n =  2528244


# lenpd
Elapsed time :  0.7561274056295324
n =  2528244


# csvreaderfor
Elapsed time :  1.5571560935772661
n =  2528244


# openenum
Elapsed time :  0.773000013928679
n =  2528244

In conclusion, sum(1 for line in f) is fastest. But there might not be significant difference from len(f.readlines()).

sample_submission.csv is 30.2MB and has 31 million characters.

Answer #8

Updated for Python 3:

import csv

with open("file.csv", newline="") as f:
    reader = csv.reader(f)
    your_list = list(reader)

print(your_list)

Output:

[["This is the first line", "Line1"], ["This is the second line", "Line2"], ["This is the third line", "Line3"]]

Answer #9

The csv file might contain very huge fields, therefore increase the field_size_limit:

import sys
import csv

csv.field_size_limit(sys.maxsize)

sys.maxsize works for Python 2.x and 3.x. sys.maxint would only work with Python 2.x (SO: what-is-sys-maxint-in-python-3)

Update

As Geoff pointed out, the code above might result in the following error: OverflowError: Python int too large to convert to C long. To circumvent this, you could use the following quick and dirty code (which should work on every system with Python 2 and Python 3):

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

Answer #10

Using the csv module:

import csv

with open("file.csv", newline="") as f:
    reader = csv.reader(f)
    data = list(reader)

print(data)

Output:

[["This is the first line", "Line1"], ["This is the second line", "Line2"], ["This is the third line", "Line3"]]

If you need tuples:

import csv

with open("file.csv", newline="") as f:
    reader = csv.reader(f)
    data = [tuple(row) for row in reader]

print(data)

Output:

[("This is the first line", "Line1"), ("This is the second line", "Line2"), ("This is the third line", "Line3")]

Old Python 2 answer, also using the csv module:

import csv
with open("file.csv", "rb") as f:
    reader = csv.reader(f)
    your_list = list(reader)

print your_list
# [["This is the first line", "Line1"],
#  ["This is the second line", "Line2"],
#  ["This is the third line", "Line3"]]

Tutorials