Python convert Set to Dict

Python Methods and Functions | to_dict




Convert a set into dictionary

Sometimes we have to convert one data structure to another for various operations and issues in our daily coding and web development. As we may want to get a dictionary from the given set items. Let's discuss some methods to convert a given set in a dictionary.




Convert Python Set to Dictionary

To convert Python Set to Dictionary, use the fromkeys() method. The fromkeys() is an inbuilt function that creates a new dictionary from the given items with a value provided by the user. Dictionary has a key-value data structure. So if we pass the keys as Set values, then we need to pass values on our own.

Syntax
dictionary.fromkeys(keys, value)

Parameters
The keys parameter is required, and it is an iterable specifying the keys of the new dictionary.

Set to Dict Using fromkeys()

# Python code to demonstrate
# converting set into dictionary
# using fromkeys()
 
# initializing set
ini_set = {1, 2, 3, 4, 5}
 
# printing initialized set
print ("initial string", ini_set)
print (type(ini_set))
 
# Converting set to dictionary
res = dict.fromkeys(ini_set, 0)
 
# printing final result and its type
print ("final list", res)
print (type(res))

Output:

initial string {1, 2, 3, 4, 5}

final list {1: 0, 2: 0, 3: 0, 4: 0, 5: 0}

Set to Dict Using dict comprehension

# Python code to demonstrate
# converting set into dictionary
# using dict comprehension
 
 
# initializing set
ini_set = {1, 2, 3, 4, 5}
 
# printing initialized set
print ("initial string", ini_set)
print (type(ini_set))
 
str = 'fg'
# Converting set to dict
res = {element:'Geek'+str for element in ini_set}
 
# printing final result and its type
print ("final list", res)
print (type(res))

Output:

initial string {1, 2, 3, 4, 5}

final list {1: 'Geek', 2: 'Geek', 3: 'Geek', 4: 'Geek', 5: 'Geek'}




Set to dict Python

StackOverflow question

is there any pythonic way to convert a set into a dict?

I got the following set

s = {1,2,4,5,6}

and want the following dict

c = {1:0, 2:0, 3:0, 4:0, 5:0, 6:0}

with a list you would do

a = [1,2,3,4,5,6]
b = []

while len(b) < len(a):
   b.append(0)

c = dict(itertools.izip(a,b))

Answer

Use dict.fromkeys():

c = dict.fromkeys(s, 0)

Demo:

>>> s = {1,2,4,5,6}
>>> dict.fromkeys(s, 0)
{1: 0, 2: 0, 4: 0, 5: 0, 6: 0}

This works for lists as well; it is the most efficient method to create a dictionary from a sequence. Note all values are references to that one default you passed into dict.fromkeys(), so be careful when that default value is a mutable object.




Archived version

Let's discuss several ways to convert this set into a dictionary.

Method # 1: Using fromkeys()

# Python code for demonstration
# convert the set to a dictionary
# using fromkeys ()

  
# initialization set

ini_set = { 1 , 2 , 3 , 4 , 5 }

 
# printing the initialized set

print ( "initial string" , ini_set)

print ( type (ini_set))

 
# Convert the set to a dictionary

res = dict . fromkeys (ini_set, 0 )

 
# print the final result and its type

print ( " final list " , res)

print ( type (res))

Exit :

 initial string {1, 2, 3, 4, 5} & lt; class 'set' & gt ; final list {1: 0, 2: 0, 3: 0, 4: 0, 5: 0} & lt; class 'dict' & gt; 

Method # 2: Using Dictation

# Python code for demonstration
# converting the set to a dictionary
# using dictation

 

  
# initialization set

ini_set = { 1 , 2 , 3 , 4 , 5 }

 
# print the initialized set

print ( "initial string" , ini_set)

print ( type (ini_set))

 

str = 'fg'

# Convert set to dict

res = {element: 'Geek' + str for element in ini_set}

 
# print the final result and its type

print ( "final list" , res)

print ( type (res))

Output:

 initial string {1, 2, 3, 4, 5} & lt; class 'set' & gt; final list {1: 'Geek', 2:' Geek', 3: 'Geek', 4:' Geek', 5: 'Geek'} & lt; class' dict' & gt; 




Python convert Set to Dict: StackOverflow Questions

Convert JSON string to dict using Python

I"m a little bit confused with JSON in Python. To me, it seems like a dictionary, and for that reason I"m trying to do that:

{
    "glossary":
    {
        "title": "example glossary",
        "GlossDiv":
        {
            "title": "S",
            "GlossList":
            {
                "GlossEntry":
                {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef":
                    {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

But when I do print dict(json), it gives an error.

How can I transform this string into a structure and then call json["title"] to obtain "example glossary"?

Convert Django Model object to dict with all of the fields intact

How does one convert a django Model object to a dict with all of its fields? All ideally includes foreign keys and fields with editable=False.

Let me elaborate. Let"s say I have a django model like the following:

from django.db import models

class OtherModel(models.Model): pass

class SomeModel(models.Model):
    normal_value = models.IntegerField()
    readonly_value = models.IntegerField(editable=False)
    auto_now_add = models.DateTimeField(auto_now_add=True)
    foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
    many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")

In the terminal, I have done the following:

other_model = OtherModel()
other_model.save()
instance = SomeModel()
instance.normal_value = 1
instance.readonly_value = 2
instance.foreign_key = other_model
instance.save()
instance.many_to_many.add(other_model)
instance.save()

I want to convert this to the following dictionary:

{"auto_now_add": datetime.datetime(2015, 3, 16, 21, 34, 14, 926738, tzinfo=<UTC>),
 "foreign_key": 1,
 "id": 1,
 "many_to_many": [1],
 "normal_value": 1,
 "readonly_value": 2}

Questions with unsatisfactory answers:

Django: Converting an entire set of a Model's objects into a single dictionary

How can I turn Django Model objects into a dictionary and still have their foreign keys?

Converting JSON String to Dictionary Not List

I am trying to pass in a JSON file and convert the data into a dictionary.

So far, this is what I have done:

import json
json1_file = open("json1")
json1_str = json1_file.read()
json1_data = json.loads(json1_str)

I"m expecting json1_data to be a dict type but it actually comes out as a list type when I check it with type(json1_data).

What am I missing? I need this to be a dictionary so I can access one of the keys.

python tuple to dict

For the tuple, t = ((1, "a"),(2, "b")) dict(t) returns {1: "a", 2: "b"}

Is there a good way to get {"a": 1, "b": 2} (keys and vals swapped)?

Ultimately, I want to be able to return 1 given "a" or 2 given "b", perhaps converting to a dict is not the best way.

String to Dictionary in Python

So I"ve spent way to much time on this, and it seems to me like it should be a simple fix. I"m trying to use Facebook"s Authentication to register users on my site, and I"m trying to do it server side. I"ve gotten to the point where I get my access token, and when I go to:

https://graph.facebook.com/me?access_token=MY_ACCESS_TOKEN

I get the information I"m looking for as a string that"s like this:

{"id":"123456789";"name":"John Doe";"first_name":"John";"last_name":"Doe";"link":"http://www.facebook.com/jdoe";"gender":"male";"email":"jdoeu0040gmail.com";"timezone":-7,"locale":"en_US";"verified":true,"updated_time":"2011-01-12T02:43:35+0000"}

It seems like I should just be able to use dict(string) on this but I"m getting this error:

ValueError: dictionary update sequence element #0 has length 1; 2 is required

So I tried using Pickle, but got this error:

KeyError: "{"

I tried using django.serializers to de-serialize it but had similar results. Any thoughts? I feel like the answer has to be simple, and I"m just being stupid. Thanks for any help!

python pandas dataframe to dictionary

I"ve a two columns dataframe, and intend to convert it to python dictionary - the first column will be the key and the second will be the value. Thank you in advance.

Dataframe:

    id    value
0    0     10.2
1    1      5.7
2    2      7.4

How to convert list of key-value tuples into dictionary?

I have a list that looks like:

[("A", 1), ("B", 2), ("C", 3)]

I want to turn it into a dictionary that looks like:

{"A": 1, "B": 2, "C": 3}

What"s the best way to go about this?

EDIT: My list of tuples is actually more like:

[(A, 12937012397), (BERA, 2034927830), (CE, 2349057340)]

I am getting the error ValueError: dictionary update sequence element #0 has length 1916; 2 is required

URL query parameters to dict python

Is there a way to parse a URL (with some python library) and return a python dictionary with the keys and values of a query parameters part of the URL?

For example:

url = "http://www.example.org/default.html?ct=32&op=92&item=98"

expected return:

{"ct":32, "op":92, "item":98}

List of tuples to dictionary

Here"s how I"m currently converting a list of tuples to dictionary in Python:

l = [("a",1),("b",2)]
h = {}
[h.update({k:v}) for k,v in l]
> [None, None]
h
> {"a": 1, "b": 2}

Is there a better way? It seems like there should be a one-liner to do this.

python pandas dataframe columns convert to dict key and value

I have a pandas data frame with multiple columns and I would like to construct a dict from two columns: one as the dict"s keys and the other as the dict"s values. How can I do that?

Dataframe:

           area  count
co tp
DE Lake      10      7
Forest       20      5
FR Lake      30      2
Forest       40      3

I need to define area as key, count as value in dict. Thank you in advance.

Answer #1

There are many ways to convert an instance to a dictionary, with varying degrees of corner case handling and closeness to the desired result.


1. instance.__dict__

instance.__dict__

which returns

{"_foreign_key_cache": <OtherModel: OtherModel object>,
 "_state": <django.db.models.base.ModelState at 0x7ff0993f6908>,
 "auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is by far the simplest, but is missing many_to_many, foreign_key is misnamed, and it has two unwanted extra things in it.


2. model_to_dict

from django.forms.models import model_to_dict
model_to_dict(instance)

which returns

{"foreign_key": 2,
 "id": 1,
 "many_to_many": [<OtherModel: OtherModel object>],
 "normal_value": 1}

This is the only one with many_to_many, but is missing the uneditable fields.


3. model_to_dict(..., fields=...)

from django.forms.models import model_to_dict
model_to_dict(instance, fields=[field.name for field in instance._meta.fields])

which returns

{"foreign_key": 2, "id": 1, "normal_value": 1}

This is strictly worse than the standard model_to_dict invocation.


4. query_set.values()

SomeModel.objects.filter(id=instance.id).values()[0]

which returns

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is the same output as instance.__dict__ but without the extra fields. foreign_key_id is still wrong and many_to_many is still missing.


5. Custom Function

The code for django"s model_to_dict had most of the answer. It explicitly removed non-editable fields, so removing that check and getting the ids of foreign keys for many to many fields results in the following code which behaves as desired:

from itertools import chain

def to_dict(instance):
    opts = instance._meta
    data = {}
    for f in chain(opts.concrete_fields, opts.private_fields):
        data[f.name] = f.value_from_object(instance)
    for f in opts.many_to_many:
        data[f.name] = [i.id for i in f.value_from_object(instance)]
    return data

While this is the most complicated option, calling to_dict(instance) gives us exactly the desired result:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

6. Use Serializers

Django Rest Framework"s ModelSerialzer allows you to build a serializer automatically from a model.

from rest_framework import serializers
class SomeModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = SomeModel
        fields = "__all__"

SomeModelSerializer(instance).data

returns

{"auto_now_add": "2018-12-20T21:34:29.494827Z",
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

This is almost as good as the custom function, but auto_now_add is a string instead of a datetime object.


Bonus Round: better model printing

If you want a django model that has a better python command-line display, have your models child-class the following:

from django.db import models
from itertools import chain

class PrintableModel(models.Model):
    def __repr__(self):
        return str(self.to_dict())

    def to_dict(instance):
        opts = instance._meta
        data = {}
        for f in chain(opts.concrete_fields, opts.private_fields):
            data[f.name] = f.value_from_object(instance)
        for f in opts.many_to_many:
            data[f.name] = [i.id for i in f.value_from_object(instance)]
        return data

    class Meta:
        abstract = True

So, for example, if we define our models as such:

class OtherModel(PrintableModel): pass

class SomeModel(PrintableModel):
    normal_value = models.IntegerField()
    readonly_value = models.IntegerField(editable=False)
    auto_now_add = models.DateTimeField(auto_now_add=True)
    foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
    many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")

Calling SomeModel.objects.first() now gives output like this:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

Answer #2

Use df.to_dict("records") -- gives the output without having to transpose externally.

In [2]: df.to_dict("records")
Out[2]:
[{"customer": 1L, "item1": "apple", "item2": "milk", "item3": "tomato"},
 {"customer": 2L, "item1": "water", "item2": "orange", "item3": "potato"},
 {"customer": 3L, "item1": "juice", "item2": "mango", "item3": "chips"}]

Answer #3

Edit

As John Galt mentions in his answer , you should probably instead use df.to_dict("records"). It"s faster than transposing manually.

In [20]: timeit df.T.to_dict().values()
1000 loops, best of 3: 395 µs per loop

In [21]: timeit df.to_dict("records")
10000 loops, best of 3: 53 µs per loop

Original answer

Use df.T.to_dict().values(), like below:

In [1]: df
Out[1]:
   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

In [2]: df.T.to_dict().values()
Out[2]:
[{"customer": 1.0, "item1": "apple", "item2": "milk", "item3": "tomato"},
 {"customer": 2.0, "item1": "water", "item2": "orange", "item3": "potato"},
 {"customer": 3.0, "item1": "juice", "item2": "mango", "item3": "chips"}]

Answer #4

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {"col1" : [1, 2, 3, 4, 5, 3], 
                           "col2" : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {"col1" : [1, 2, 3],
                           "col2" : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=["col1","col2"], 
                   how="left", indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

df_all["_merge"] == "left_only"

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=["col1","col2"])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict("l")).all(1)

Answer #5

How do I convert a list of dictionaries to a pandas DataFrame?

The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.


DataFrame(), DataFrame.from_records(), and .from_dict()

Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don"t work at all.

Consider a very contrived example.

np.random.seed(0)
data = pd.DataFrame(
    np.random.choice(10, (3, 4)), columns=list("ABCD")).to_dict("r")

print(data)
[{"A": 5, "B": 0, "C": 3, "D": 3},
 {"A": 7, "B": 9, "C": 3, "D": 5},
 {"A": 2, "B": 4, "C": 7, "D": 6}]

This list consists of "records" with every keys present. This is the simplest case you could encounter.

# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Word on Dictionary Orientations: orient="index"/"columns"

Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".

orient="columns"
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.

For example, data above is in the "columns" orient.

data_c = [
 {"A": 5, "B": 0, "C": 3, "D": 3},
 {"A": 7, "B": 9, "C": 3, "D": 5},
 {"A": 2, "B": 4, "C": 7, "D": 6}]
pd.DataFrame.from_dict(data_c, orient="columns")

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.

orient="index"
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.

data_i ={
 0: {"A": 5, "B": 0, "C": 3, "D": 3},
 1: {"A": 7, "B": 9, "C": 3, "D": 5},
 2: {"A": 2, "B": 4, "C": 7, "D": 6}}
pd.DataFrame.from_dict(data_i, orient="index")

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

This case is not considered in the OP, but is still useful to know.

Setting Custom Index

If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.

pd.DataFrame(data, index=["a", "b", "c"])
# pd.DataFrame.from_records(data, index=["a", "b", "c"])

   A  B  C  D
a  5  0  3  3
b  7  9  3  5
c  2  4  7  6

This is not supported by pd.DataFrame.from_dict.

Dealing with Missing Keys/Columns

All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,

data2 = [
     {"A": 5, "C": 3, "D": 3},
     {"A": 7, "B": 9, "F": 5},
     {"B": 4, "C": 7, "E": 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)

     A    B    C    D    E    F
0  5.0  NaN  3.0  3.0  NaN  NaN
1  7.0  9.0  NaN  NaN  NaN  5.0
2  NaN  4.0  7.0  NaN  6.0  NaN

Reading Subset of Columns

"What if I don"t want to read in every single column"? You can easily specify this using the columns=... parameter.

For example, from the example dictionary of data2 above, if you wanted to read only columns "A", "D", and "F", you can do so by passing a list:

pd.DataFrame(data2, columns=["A", "D", "F"])
# pd.DataFrame.from_records(data2, columns=["A", "D", "F"])

     A    D    F
0  5.0  3.0  NaN
1  7.0  NaN  5.0
2  NaN  NaN  NaN

This is not supported by pd.DataFrame.from_dict with the default orient "columns".

pd.DataFrame.from_dict(data2, orient="columns", columns=["A", "B"])
ValueError: cannot use columns parameter with orient="columns"

Reading Subset of Rows

Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:

rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
    if i not in rows_to_select:
        del data2[i]

pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

The Panacea: json_normalize for Nested Data

A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.

pd.json_normalize(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
pd.json_normalize(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.

As mentioned, json_normalize can also handle nested dictionaries. Here"s an example taken from the documentation.

data_nested = [
  {"counties": [{"name": "Dade", "population": 12345},
                {"name": "Broward", "population": 40000},
                {"name": "Palm Beach", "population": 60000}],
   "info": {"governor": "Rick Scott"},
   "shortname": "FL",
   "state": "Florida"},
  {"counties": [{"name": "Summit", "population": 1234},
                {"name": "Cuyahoga", "population": 1337}],
   "info": {"governor": "John Kasich"},
   "shortname": "OH",
   "state": "Ohio"}
]
pd.json_normalize(data_nested, 
                          record_path="counties", 
                          meta=["state", "shortname", ["info", "governor"]])

         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

For more information on the meta and record_path arguments, check out the documentation.


Summarising

Here"s a table of all the methods discussed above, along with supported features/functionality.

enter image description here

* Use orient="columns" and then transpose to get the same effect as orient="index".

Answer #6

TLDR; No, for loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (...depending on what you"re doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let"s examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What"s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

enter image description here

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example - this time, with another vanilla python construct that is faster than a for loop - collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

enter image description here

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don"t give you the performance you need, there is always cython and numba. Let"s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

enter image description here

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

enter image description here

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

# Dictionary value extraction.
ser.map(operator.itemgetter("value"))     # map
pd.Series([x.get("value") for x in ser])  # list comprehension

enter image description here

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

enter image description here

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

enter image description here

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it"s that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other "vectorized" string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python's re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df["col2"] = [p.search(x).group(0) for x in df["col"]]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df["col2"] = [matcher(x) for x in df["col"]]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r"(?<=[A-Z])(d{4})")
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

enter image description here

More Examples
Full disclosure - I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it"s up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"]),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query("A != B"),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=["vectorized !=", "query (numexpr)", "list comp", "numba"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N"
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=["value_counts", "np.unique", "Counter"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query("A != B"),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=["vectorized !=", "query (numexpr)", "list comp"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{"key": "abc", "value": 123}, {"key": "xyz", "value": 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter("value")),
        lambda ser: pd.Series([x.get("value") for x in ser]),
    ],
    labels=["map", "list comprehension"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([["a", "b", "c"], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=["map", "str accessor", "list comprehension", "list comp safe"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([["a", "b", "c"], ["d", "e"], ["f", "g"]])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=["stack", "itertools.chain", "nested list comp"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(["foo xyz", "test A1234", "D3345 xtz"])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=["str.extract", "list comprehension"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

Answer #7

I"d like to shed a little bit more light on the interplay of iter, __iter__ and __getitem__ and what happens behind the curtains. Armed with that knowledge, you will be able to understand why the best you can do is

try:
    iter(maybe_iterable)
    print("iteration will probably work")
except TypeError:
    print("not iterable")

I will list the facts first and then follow up with a quick reminder of what happens when you employ a for loop in python, followed by a discussion to illustrate the facts.

Facts

  1. You can get an iterator from any object o by calling iter(o) if at least one of the following conditions holds true:

    a) o has an __iter__ method which returns an iterator object. An iterator is any object with an __iter__ and a __next__ (Python 2: next) method.

    b) o has a __getitem__ method.

  2. Checking for an instance of Iterable or Sequence, or checking for the attribute __iter__ is not enough.

  3. If an object o implements only __getitem__, but not __iter__, iter(o) will construct an iterator that tries to fetch items from o by integer index, starting at index 0. The iterator will catch any IndexError (but no other errors) that is raised and then raises StopIteration itself.

  4. In the most general sense, there"s no way to check whether the iterator returned by iter is sane other than to try it out.

  5. If an object o implements __iter__, the iter function will make sure that the object returned by __iter__ is an iterator. There is no sanity check if an object only implements __getitem__.

  6. __iter__ wins. If an object o implements both __iter__ and __getitem__, iter(o) will call __iter__.

  7. If you want to make your own objects iterable, always implement the __iter__ method.

for loops

In order to follow along, you need an understanding of what happens when you employ a for loop in Python. Feel free to skip right to the next section if you already know.

When you use for item in o for some iterable object o, Python calls iter(o) and expects an iterator object as the return value. An iterator is any object which implements a __next__ (or next in Python 2) method and an __iter__ method.

By convention, the __iter__ method of an iterator should return the object itself (i.e. return self). Python then calls next on the iterator until StopIteration is raised. All of this happens implicitly, but the following demonstration makes it visible:

import random

class DemoIterable(object):
    def __iter__(self):
        print("__iter__ called")
        return DemoIterator()

class DemoIterator(object):
    def __iter__(self):
        return self

    def __next__(self):
        print("__next__ called")
        r = random.randint(1, 10)
        if r == 5:
            print("raising StopIteration")
            raise StopIteration
        return r

Iteration over a DemoIterable:

>>> di = DemoIterable()
>>> for x in di:
...     print(x)
...
__iter__ called
__next__ called
9
__next__ called
8
__next__ called
10
__next__ called
3
__next__ called
10
__next__ called
raising StopIteration

Discussion and illustrations

On point 1 and 2: getting an iterator and unreliable checks

Consider the following class:

class BasicIterable(object):
    def __getitem__(self, item):
        if item == 3:
            raise IndexError
        return item

Calling iter with an instance of BasicIterable will return an iterator without any problems because BasicIterable implements __getitem__.

>>> b = BasicIterable()
>>> iter(b)
<iterator object at 0x7f1ab216e320>

However, it is important to note that b does not have the __iter__ attribute and is not considered an instance of Iterable or Sequence:

>>> from collections import Iterable, Sequence
>>> hasattr(b, "__iter__")
False
>>> isinstance(b, Iterable)
False
>>> isinstance(b, Sequence)
False

This is why Fluent Python by Luciano Ramalho recommends calling iter and handling the potential TypeError as the most accurate way to check whether an object is iterable. Quoting directly from the book:

As of Python 3.4, the most accurate way to check whether an object x is iterable is to call iter(x) and handle a TypeError exception if it isn’t. This is more accurate than using isinstance(x, abc.Iterable) , because iter(x) also considers the legacy __getitem__ method, while the Iterable ABC does not.

On point 3: Iterating over objects which only provide __getitem__, but not __iter__

Iterating over an instance of BasicIterable works as expected: Python constructs an iterator that tries to fetch items by index, starting at zero, until an IndexError is raised. The demo object"s __getitem__ method simply returns the item which was supplied as the argument to __getitem__(self, item) by the iterator returned by iter.

>>> b = BasicIterable()
>>> it = iter(b)
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Note that the iterator raises StopIteration when it cannot return the next item and that the IndexError which is raised for item == 3 is handled internally. This is why looping over a BasicIterable with a for loop works as expected:

>>> for x in b:
...     print(x)
...
0
1
2

Here"s another example in order to drive home the concept of how the iterator returned by iter tries to access items by index. WrappedDict does not inherit from dict, which means instances won"t have an __iter__ method.

class WrappedDict(object): # note: no inheritance from dict!
    def __init__(self, dic):
        self._dict = dic

    def __getitem__(self, item):
        try:
            return self._dict[item] # delegate to dict.__getitem__
        except KeyError:
            raise IndexError

Note that calls to __getitem__ are delegated to dict.__getitem__ for which the square bracket notation is simply a shorthand.

>>> w = WrappedDict({-1: "not printed",
...                   0: "hi", 1: "StackOverflow", 2: "!",
...                   4: "not printed", 
...                   "x": "not printed"})
>>> for x in w:
...     print(x)
... 
hi
StackOverflow
!

On point 4 and 5: iter checks for an iterator when it calls __iter__:

When iter(o) is called for an object o, iter will make sure that the return value of __iter__, if the method is present, is an iterator. This means that the returned object must implement __next__ (or next in Python 2) and __iter__. iter cannot perform any sanity checks for objects which only provide __getitem__, because it has no way to check whether the items of the object are accessible by integer index.

class FailIterIterable(object):
    def __iter__(self):
        return object() # not an iterator

class FailGetitemIterable(object):
    def __getitem__(self, item):
        raise Exception

Note that constructing an iterator from FailIterIterable instances fails immediately, while constructing an iterator from FailGetItemIterable succeeds, but will throw an Exception on the first call to __next__.

>>> fii = FailIterIterable()
>>> iter(fii)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: iter() returned non-iterator of type "object"
>>>
>>> fgi = FailGetitemIterable()
>>> it = iter(fgi)
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/iterdemo.py", line 42, in __getitem__
    raise Exception
Exception

On point 6: __iter__ wins

This one is straightforward. If an object implements __iter__ and __getitem__, iter will call __iter__. Consider the following class

class IterWinsDemo(object):
    def __iter__(self):
        return iter(["__iter__", "wins"])

    def __getitem__(self, item):
        return ["__getitem__", "wins"][item]

and the output when looping over an instance:

>>> iwd = IterWinsDemo()
>>> for x in iwd:
...     print(x)
...
__iter__
wins

On point 7: your iterable classes should implement __iter__

You might ask yourself why most builtin sequences like list implement an __iter__ method when __getitem__ would be sufficient.

class WrappedList(object): # note: no inheritance from list!
    def __init__(self, lst):
        self._list = lst

    def __getitem__(self, item):
        return self._list[item]

After all, iteration over instances of the class above, which delegates calls to __getitem__ to list.__getitem__ (using the square bracket notation), will work fine:

>>> wl = WrappedList(["A", "B", "C"])
>>> for x in wl:
...     print(x)
... 
A
B
C

The reasons your custom iterables should implement __iter__ are as follows:

  1. If you implement __iter__, instances will be considered iterables, and isinstance(o, collections.abc.Iterable) will return True.
  2. If the object returned by __iter__ is not an iterator, iter will fail immediately and raise a TypeError.
  3. The special handling of __getitem__ exists for backwards compatibility reasons. Quoting again from Fluent Python:

That is why any Python sequence is iterable: they all implement __getitem__ . In fact, the standard sequences also implement __iter__, and yours should too, because the special handling of __getitem__ exists for backward compatibility reasons and may be gone in the future (although it is not deprecated as I write this).

Answer #8

  • Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then)

  • Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

  • Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

  • Parquet is a standard storage format for analytics that"s supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

The benchmarks you showed are going to be very noisy since the data you read and wrote is very small. You should try compressing at least 100MB or upwards 1GB of data to get some more informative benchmarks, see e.g. http://wesmckinney.com/blog/python-parquet-multithreading/

Hope this helps

Answer #9

There are at least five six ways. The preferred way depends on what your use case is.

Option 1:

Simply add an asdict() method.

Based on the problem description I would very much consider the asdict way of doing things suggested by other answers. This is because it does not appear that your object is really much of a collection:

class Wharrgarbl(object):

    ...

    def asdict(self):
        return {"a": self.a, "b": self.b, "c": self.c}

Using the other options below could be confusing for others unless it is very obvious exactly which object members would and would not be iterated or specified as key-value pairs.

Option 1a:

Inherit your class from "typing.NamedTuple" (or the mostly equivalent "collections.namedtuple"), and use the _asdict method provided for you.

from typing import NamedTuple

class Wharrgarbl(NamedTuple):
    a: str
    b: str
    c: str
    sum: int = 6
    version: str = "old"

Using a named tuple is a very convenient way to add lots of functionality to your class with a minimum of effort, including an _asdict method. However, a limitation is that, as shown above, the NT will include all the members in its _asdict.

If there are members you don"t want to include in your dictionary, you"ll need to modify the _asdict result:

from typing import NamedTuple

class Wharrgarbl(NamedTuple):
    a: str
    b: str
    c: str
    sum: int = 6
    version: str = "old"

    def _asdict(self):
        d = super()._asdict()
        del d["sum"]
        del d["version"]
        return d

Another limitation is that NT is read-only. This may or may not be desirable.

Option 2:

Implement __iter__.

Like this, for example:

def __iter__(self):
    yield "a", self.a
    yield "b", self.b
    yield "c", self.c

Now you can just do:

dict(my_object)

This works because the dict() constructor accepts an iterable of (key, value) pairs to construct a dictionary. Before doing this, ask yourself the question whether iterating the object as a series of key,value pairs in this manner- while convenient for creating a dict- might actually be surprising behavior in other contexts. E.g., ask yourself the question "what should the behavior of list(my_object) be...?"

Additionally, note that accessing values directly using the get item obj["a"] syntax will not work, and keyword argument unpacking won"t work. For those, you"d need to implement the mapping protocol.

Option 3:

Implement the mapping protocol. This allows access-by-key behavior, casting to a dict without using __iter__, and also provides unpacking behavior ({**my_obj}) and keyword unpacking behavior if all the keys are strings (dict(**my_obj)).

The mapping protocol requires that you provide (at minimum) two methods together: keys() and __getitem__.

class MyKwargUnpackable:
    def keys(self):
        return list("abc")
    def __getitem__(self, key):
        return dict(zip("abc", "one two three".split()))[key]

Now you can do things like:

>>> m=MyKwargUnpackable()
>>> m["a"]
"one"
>>> dict(m)  # cast to dict directly
{"a": "one", "b": "two", "c": "three"}
>>> dict(**m)  # unpack as kwargs
{"a": "one", "b": "two", "c": "three"}

As mentioned above, if you are using a new enough version of python you can also unpack your mapping-protocol object into a dictionary comprehension like so (and in this case it is not required that your keys be strings):

>>> {**m}
{"a": "one", "b": "two", "c": "three"}

Note that the mapping protocol takes precedence over the __iter__ method when casting an object to a dict directly (without using kwarg unpacking, i.e. dict(m)). So it is possible- and sometimes convenient- to cause the object to have different behavior when used as an iterable (e.g., list(m)) vs. when cast to a dict (dict(m)).

EMPHASIZED: Just because you CAN use the mapping protocol, does NOT mean that you SHOULD do so. Does it actually make sense for your object to be passed around as a set of key-value pairs, or as keyword arguments and values? Does accessing it by key- just like a dictionary- really make sense?

If the answer to these questions is yes, it"s probably a good idea to consider the next option.

Option 4:

Look into using the "collections.abc" module.

Inheriting your class from "collections.abc.Mapping or "collections.abc.MutableMapping signals to other users that, for all intents and purposes, your class is a mapping * and can be expected to behave that way.

You can still cast your object to a dict just as you require, but there would probably be little reason to do so. Because of duck typing, bothering to cast your mapping object to a dict would just be an additional unnecessary step the majority of the time.

This answer might also be helpful.

As noted in the comments below: it"s worth mentioning that doing this the abc way essentially turns your object class into a dict-like class (assuming you use MutableMapping and not the read-only Mapping base class). Everything you would be able to do with dict, you could do with your own class object. This may be, or may not be, desirable.

Also consider looking at the numerical abcs in the numbers module:

https://docs.python.org/3/library/numbers.html

Since you"re also casting your object to an int, it might make more sense to essentially turn your class into a full fledged int so that casting isn"t necessary.

Option 5:

Look into using the dataclasses module (Python 3.7 only), which includes a convenient asdict() utility method.

from dataclasses import dataclass, asdict, field, InitVar

@dataclass
class Wharrgarbl(object):
    a: int
    b: int
    c: int
    sum: InitVar[int]  # note: InitVar will exclude this from the dict
    version: InitVar[str] = "old"

    def __post_init__(self, sum, version):
        self.sum = 6  # this looks like an OP mistake?
        self.version = str(version)

Now you can do this:

    >>> asdict(Wharrgarbl(1,2,3,4,"X"))
    {"a": 1, "b": 2, "c": 3}

Option 6:

Use typing.TypedDict, which has been added in python 3.8.

NOTE: option 6 is likely NOT what the OP, or other readers based on the title of this question, are looking for. See additional comments below.

class Wharrgarbl(TypedDict):
    a: str
    b: str
    c: str

Using this option, the resulting object is a dict (emphasis: it is not a Wharrgarbl). There is no reason at all to "cast" it to a dict (unless you are making a copy).

And since the object is a dict, the initialization signature is identical to that of dict and as such it only accepts keyword arguments or another dictionary.

    >>> w = Wharrgarbl(a=1,b=2,b=3)
    >>> w
    {"a": 1, "b": 2, "c": 3}
    >>> type(w)
    <class "dict">

Emphasized: the above "class" Wharrgarbl isn"t actually a new class at all. It is simply syntactic sugar for creating typed dict objects with fields of different types for the type checker.

As such this option can be pretty convenient for signaling to readers of your code (and also to a type checker such as mypy) that such a dict object is expected to have specific keys with specific value types.

But this means you cannot, for example, add other methods, although you can try:

class MyDict(TypedDict):
    def my_fancy_method(self):
        return "world changing result"

...but it won"t work:

>>> MyDict().my_fancy_method()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: "dict" object has no attribute "my_fancy_method"

* "Mapping" has become the standard "name" of the dict-like duck type

Answer #10

Or if you are already using pandas, You can do it with json_normalize() like so:

import pandas as pd

d = {"a": 1,
     "c": {"a": 2, "b": {"x": 5, "y" : 10}},
     "d": [1, 2, 3]}

df = pd.json_normalize(d, sep="_")

print(df.to_dict(orient="records")[0])

Output:

{"a": 1, "c_a": 2, "c_b_x": 5, "c_b_y": 10, "d": [1, 2, 3]}

Get Solution for free from DataCamp guru