How to import a text file on AWS S3 into pandas without writing to disk

StackOverflow

I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.

import io
import boto3
import os
import pandas as pd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"

s3_client = boto3.client("s3")
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]


pd.read_csv(file, header=14, delimiter="	", low_memory=False)

the error is

OSError: Expected file path name or file-like object, got <class "bytes"> type

How do I convert the response body into a format pandas will accept?

pd.read_csv(io.StringIO(file), header=14, delimiter="	", low_memory=False)

returns

TypeError: initial_value must be str or None, not StreamingBody

pd.read_csv(io.BytesIO(file), header=14, delimiter="	", low_memory=False)

returns

TypeError: "StreamingBody" does not support the buffer interface

UPDATE - Using the following worked

file = response["Body"].read()

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="	", low_memory=False)

Answer rating: 130

pandas uses boto for read_csv, so you should be able to:

import boto
data = pd.read_csv("s3://bucket....csv")

If you need boto3 because you are on python3.4+, you can

import boto3
import io
s3 = boto3.client("s3")
obj = s3.get_object(Bucket="bucket", Key="key")
df = pd.read_csv(io.BytesIO(obj["Body"].read()))

Since version 0.20.1 pandas uses s3fs, see answer below.

Answer rating: 101

Now pandas can handle S3 URLs. You could simply do:

import pandas as pd
import s3fs

df = pd.read_csv("s3://bucket-name/file.csv")

You need to install s3fs if you don"t have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

2- Set the following environment variables with their proper values:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token




Get Solution for free from DataCamp guru