python pandas dataframe columns convert to dict key and value

__dict__ | StackOverflow | to_dict

I have a pandas data frame with multiple columns and I would like to construct a dict from two columns: one as the dict"s keys and the other as the dict"s values. How can I do that?

Dataframe:

           area  count
co tp
DE Lake      10      7
Forest       20      5
FR Lake      30      2
Forest       40      3

I need to define area as key, count as value in dict. Thank you in advance.

Answer rating: 288

If lakes is your DataFrame, you can do something like

area_dict = dict(zip(lakes.area, lakes.count))




python pandas dataframe columns convert to dict key and value: StackOverflow Questions

How do I merge two dictionaries in a single expression (taking union of dictionaries)?

Question by Carl Meyer

I have two Python dictionaries, and I want to write a single expression that returns these two dictionaries, merged (i.e. taking the union). The update() method would be what I need, if it returned its result instead of modifying a dictionary in-place.

>>> x = {"a": 1, "b": 2}
>>> y = {"b": 10, "c": 11}
>>> z = x.update(y)
>>> print(z)
None
>>> x
{"a": 1, "b": 10, "c": 11}

How can I get that final merged dictionary in z, not x?

(To be extra-clear, the last-one-wins conflict-handling of dict.update() is what I"m looking for as well.)

Iterating over dictionaries using "for" loops

I am a bit puzzled by the following code:

d = {"x": 1, "y": 2, "z": 3} 
for key in d:
    print (key, "corresponds to", d[key])

What I don"t understand is the key portion. How does Python recognize that it needs only to read the key from the dictionary? Is key a special word in Python? Or is it simply a variable?

How do I sort a dictionary by value?

Question by FKCoder

I have a dictionary of values read from two fields in a database: a string field and a numeric field. The string field is unique, so that is the key of the dictionary.

I can sort on the keys, but how can I sort based on the values?

Note: I have read Stack Overflow question here How do I sort a list of dictionaries by a value of the dictionary? and probably could change my code to have a list of dictionaries, but since I do not really need a list of dictionaries I wanted to know if there is a simpler solution to sort either in ascending or descending order.

How can I add new keys to a dictionary?

Is it possible to add a key to a Python dictionary after it has been created?

It doesn"t seem to have an .add() method.

Check if a given key already exists in a dictionary

I wanted to test if a key exists in a dictionary before updating the value for the key. I wrote the following code:

if "key1" in dict.keys():
  print "blah"
else:
  print "boo"

I think this is not the best way to accomplish this task. Is there a better way to test for a key in the dictionary?

How do I sort a list of dictionaries by a value of the dictionary?

I have a list of dictionaries and want each item to be sorted by a specific value.

Take into consideration the list:

[{"name":"Homer", "age":39}, {"name":"Bart", "age":10}]

When sorted by name, it should become:

[{"name":"Bart", "age":10}, {"name":"Homer", "age":39}]

How can I remove a key from a Python dictionary?

When deleting a key from a dictionary, I use:

if "key" in my_dict:
    del my_dict["key"]

Is there a one line way of doing this?

Delete an element from a dictionary

Is there a way to delete an item from a dictionary in Python?

Additionally, how can I delete an item from a dictionary to return a copy (i.e., not modifying the original)?

How do I convert two lists into a dictionary?

Question by Guido García

Imagine that you have the following list.

keys = ["name", "age", "food"]
values = ["Monty", 42, "spam"]

What is the simplest way to produce the following dictionary?

a_dict = {"name": "Monty", "age": 42, "food": "spam"}

Create a dictionary with list comprehension

I like the Python list comprehension syntax.

Can it be used to create dictionaries too? For example, by iterating over pairs of keys and values:

mydict = {(k,v) for (k,v) in blah blah blah}  # doesn"t work

Answer #1

In Python, what is the purpose of __slots__ and what are the cases one should avoid this?

TLDR:

The special attribute __slots__ allows you to explicitly state which instance attributes you expect your object instances to have, with the expected results:

  1. faster attribute access.
  2. space savings in memory.

The space savings is from

  1. Storing value references in slots instead of __dict__.
  2. Denying __dict__ and __weakref__ creation if parent classes deny them and you declare __slots__.

Quick Caveats

Small caveat, you should only declare a particular slot one time in an inheritance tree. For example:

class Base:
    __slots__ = "foo", "bar"

class Right(Base):
    __slots__ = "baz", 

class Wrong(Base):
    __slots__ = "foo", "bar", "baz"        # redundant foo and bar

Python doesn"t object when you get this wrong (it probably should), problems might not otherwise manifest, but your objects will take up more space than they otherwise should. Python 3.8:

>>> from sys import getsizeof
>>> getsizeof(Right()), getsizeof(Wrong())
(56, 72)

This is because the Base"s slot descriptor has a slot separate from the Wrong"s. This shouldn"t usually come up, but it could:

>>> w = Wrong()
>>> w.foo = "foo"
>>> Base.foo.__get__(w)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: foo
>>> Wrong.foo.__get__(w)
"foo"

The biggest caveat is for multiple inheritance - multiple "parent classes with nonempty slots" cannot be combined.

To accommodate this restriction, follow best practices: Factor out all but one or all parents" abstraction which their concrete class respectively and your new concrete class collectively will inherit from - giving the abstraction(s) empty slots (just like abstract base classes in the standard library).

See section on multiple inheritance below for an example.

Requirements:

  • To have attributes named in __slots__ to actually be stored in slots instead of a __dict__, a class must inherit from object (automatic in Python 3, but must be explicit in Python 2).

  • To prevent the creation of a __dict__, you must inherit from object and all classes in the inheritance must declare __slots__ and none of them can have a "__dict__" entry.

There are a lot of details if you wish to keep reading.

Why use __slots__: Faster attribute access.

The creator of Python, Guido van Rossum, states that he actually created __slots__ for faster attribute access.

It is trivial to demonstrate measurably significant faster access:

import timeit

class Foo(object): __slots__ = "foo",

class Bar(object): pass

slotted = Foo()
not_slotted = Bar()

def get_set_delete_fn(obj):
    def get_set_delete():
        obj.foo = "foo"
        obj.foo
        del obj.foo
    return get_set_delete

and

>>> min(timeit.repeat(get_set_delete_fn(slotted)))
0.2846834529991611
>>> min(timeit.repeat(get_set_delete_fn(not_slotted)))
0.3664822799983085

The slotted access is almost 30% faster in Python 3.5 on Ubuntu.

>>> 0.3664822799983085 / 0.2846834529991611
1.2873325658284342

In Python 2 on Windows I have measured it about 15% faster.

Why use __slots__: Memory Savings

Another purpose of __slots__ is to reduce the space in memory that each object instance takes up.

My own contribution to the documentation clearly states the reasons behind this:

The space saved over using __dict__ can be significant.

SQLAlchemy attributes a lot of memory savings to __slots__.

To verify this, using the Anaconda distribution of Python 2.7 on Ubuntu Linux, with guppy.hpy (aka heapy) and sys.getsizeof, the size of a class instance without __slots__ declared, and nothing else, is 64 bytes. That does not include the __dict__. Thank you Python for lazy evaluation again, the __dict__ is apparently not called into existence until it is referenced, but classes without data are usually useless. When called into existence, the __dict__ attribute is a minimum of 280 bytes additionally.

In contrast, a class instance with __slots__ declared to be () (no data) is only 16 bytes, and 56 total bytes with one item in slots, 64 with two.

For 64 bit Python, I illustrate the memory consumption in bytes in Python 2.7 and 3.6, for __slots__ and __dict__ (no slots defined) for each point where the dict grows in 3.6 (except for 0, 1, and 2 attributes):

       Python 2.7             Python 3.6
attrs  __slots__  __dict__*   __slots__  __dict__* | *(no slots defined)
none   16         56 + 272†   16         56 + 112† | †if __dict__ referenced
one    48         56 + 272    48         56 + 112
two    56         56 + 272    56         56 + 112
six    88         56 + 1040   88         56 + 152
11     128        56 + 1040   128        56 + 240
22     216        56 + 3344   216        56 + 408     
43     384        56 + 3344   384        56 + 752

So, in spite of smaller dicts in Python 3, we see how nicely __slots__ scale for instances to save us memory, and that is a major reason you would want to use __slots__.

Just for completeness of my notes, note that there is a one-time cost per slot in the class"s namespace of 64 bytes in Python 2, and 72 bytes in Python 3, because slots use data descriptors like properties, called "members".

>>> Foo.foo
<member "foo" of "Foo" objects>
>>> type(Foo.foo)
<class "member_descriptor">
>>> getsizeof(Foo.foo)
72

Demonstration of __slots__:

To deny the creation of a __dict__, you must subclass object. Everything subclasses object in Python 3, but in Python 2 you had to be explicit:

class Base(object): 
    __slots__ = ()

now:

>>> b = Base()
>>> b.a = "a"
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    b.a = "a"
AttributeError: "Base" object has no attribute "a"

Or subclass another class that defines __slots__

class Child(Base):
    __slots__ = ("a",)

and now:

c = Child()
c.a = "a"

but:

>>> c.b = "b"
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    c.b = "b"
AttributeError: "Child" object has no attribute "b"

To allow __dict__ creation while subclassing slotted objects, just add "__dict__" to the __slots__ (note that slots are ordered, and you shouldn"t repeat slots that are already in parent classes):

class SlottedWithDict(Child): 
    __slots__ = ("__dict__", "b")

swd = SlottedWithDict()
swd.a = "a"
swd.b = "b"
swd.c = "c"

and

>>> swd.__dict__
{"c": "c"}

Or you don"t even need to declare __slots__ in your subclass, and you will still use slots from the parents, but not restrict the creation of a __dict__:

class NoSlots(Child): pass
ns = NoSlots()
ns.a = "a"
ns.b = "b"

And:

>>> ns.__dict__
{"b": "b"}

However, __slots__ may cause problems for multiple inheritance:

class BaseA(object): 
    __slots__ = ("a",)

class BaseB(object): 
    __slots__ = ("b",)

Because creating a child class from parents with both non-empty slots fails:

>>> class Child(BaseA, BaseB): __slots__ = ()
Traceback (most recent call last):
  File "<pyshell#68>", line 1, in <module>
    class Child(BaseA, BaseB): __slots__ = ()
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

If you run into this problem, You could just remove __slots__ from the parents, or if you have control of the parents, give them empty slots, or refactor to abstractions:

from abc import ABC

class AbstractA(ABC):
    __slots__ = ()

class BaseA(AbstractA): 
    __slots__ = ("a",)

class AbstractB(ABC):
    __slots__ = ()

class BaseB(AbstractB): 
    __slots__ = ("b",)

class Child(AbstractA, AbstractB): 
    __slots__ = ("a", "b")

c = Child() # no problem!

Add "__dict__" to __slots__ to get dynamic assignment:

class Foo(object):
    __slots__ = "bar", "baz", "__dict__"

and now:

>>> foo = Foo()
>>> foo.boink = "boink"

So with "__dict__" in slots we lose some of the size benefits with the upside of having dynamic assignment and still having slots for the names we do expect.

When you inherit from an object that isn"t slotted, you get the same sort of semantics when you use __slots__ - names that are in __slots__ point to slotted values, while any other values are put in the instance"s __dict__.

Avoiding __slots__ because you want to be able to add attributes on the fly is actually not a good reason - just add "__dict__" to your __slots__ if this is required.

You can similarly add __weakref__ to __slots__ explicitly if you need that feature.

Set to empty tuple when subclassing a namedtuple:

The namedtuple builtin make immutable instances that are very lightweight (essentially, the size of tuples) but to get the benefits, you need to do it yourself if you subclass them:

from collections import namedtuple
class MyNT(namedtuple("MyNT", "bar baz")):
    """MyNT is an immutable and lightweight object"""
    __slots__ = ()

usage:

>>> nt = MyNT("bar", "baz")
>>> nt.bar
"bar"
>>> nt.baz
"baz"

And trying to assign an unexpected attribute raises an AttributeError because we have prevented the creation of __dict__:

>>> nt.quux = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "MyNT" object has no attribute "quux"

You can allow __dict__ creation by leaving off __slots__ = (), but you can"t use non-empty __slots__ with subtypes of tuple.

Biggest Caveat: Multiple inheritance

Even when non-empty slots are the same for multiple parents, they cannot be used together:

class Foo(object): 
    __slots__ = "foo", "bar"
class Bar(object):
    __slots__ = "foo", "bar" # alas, would work if empty, i.e. ()

>>> class Baz(Foo, Bar): pass
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
    multiple bases have instance lay-out conflict

Using an empty __slots__ in the parent seems to provide the most flexibility, allowing the child to choose to prevent or allow (by adding "__dict__" to get dynamic assignment, see section above) the creation of a __dict__:

class Foo(object): __slots__ = ()
class Bar(object): __slots__ = ()
class Baz(Foo, Bar): __slots__ = ("foo", "bar")
b = Baz()
b.foo, b.bar = "foo", "bar"

You don"t have to have slots - so if you add them, and remove them later, it shouldn"t cause any problems.

Going out on a limb here: If you"re composing mixins or using abstract base classes, which aren"t intended to be instantiated, an empty __slots__ in those parents seems to be the best way to go in terms of flexibility for subclassers.

To demonstrate, first, let"s create a class with code we"d like to use under multiple inheritance

class AbstractBase:
    __slots__ = ()
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return f"{type(self).__name__}({repr(self.a)}, {repr(self.b)})"

We could use the above directly by inheriting and declaring the expected slots:

class Foo(AbstractBase):
    __slots__ = "a", "b"

But we don"t care about that, that"s trivial single inheritance, we need another class we might also inherit from, maybe with a noisy attribute:

class AbstractBaseC:
    __slots__ = ()
    @property
    def c(self):
        print("getting c!")
        return self._c
    @c.setter
    def c(self, arg):
        print("setting c!")
        self._c = arg

Now if both bases had nonempty slots, we couldn"t do the below. (In fact, if we wanted, we could have given AbstractBase nonempty slots a and b, and left them out of the below declaration - leaving them in would be wrong):

class Concretion(AbstractBase, AbstractBaseC):
    __slots__ = "a b _c".split()

And now we have functionality from both via multiple inheritance, and can still deny __dict__ and __weakref__ instantiation:

>>> c = Concretion("a", "b")
>>> c.c = c
setting c!
>>> c.c
getting c!
Concretion("a", "b")
>>> c.d = "d"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: "Concretion" object has no attribute "d"

Other cases to avoid slots:

  • Avoid them when you want to perform __class__ assignment with another class that doesn"t have them (and you can"t add them) unless the slot layouts are identical. (I am very interested in learning who is doing this and why.)
  • Avoid them if you want to subclass variable length builtins like long, tuple, or str, and you want to add attributes to them.
  • Avoid them if you insist on providing default values via class attributes for instance variables.

You may be able to tease out further caveats from the rest of the __slots__ documentation (the 3.7 dev docs are the most current), which I have made significant recent contributions to.

Critiques of other answers

The current top answers cite outdated information and are quite hand-wavy and miss the mark in some important ways.

Do not "only use __slots__ when instantiating lots of objects"

I quote:

"You would want to use __slots__ if you are going to instantiate a lot (hundreds, thousands) of objects of the same class."

Abstract Base Classes, for example, from the collections module, are not instantiated, yet __slots__ are declared for them.

Why?

If a user wishes to deny __dict__ or __weakref__ creation, those things must not be available in the parent classes.

__slots__ contributes to reusability when creating interfaces or mixins.

It is true that many Python users aren"t writing for reusability, but when you are, having the option to deny unnecessary space usage is valuable.

__slots__ doesn"t break pickling

When pickling a slotted object, you may find it complains with a misleading TypeError:

>>> pickle.loads(pickle.dumps(f))
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

This is actually incorrect. This message comes from the oldest protocol, which is the default. You can select the latest protocol with the -1 argument. In Python 2.7 this would be 2 (which was introduced in 2.3), and in 3.6 it is 4.

>>> pickle.loads(pickle.dumps(f, -1))
<__main__.Foo object at 0x1129C770>

in Python 2.7:

>>> pickle.loads(pickle.dumps(f, 2))
<__main__.Foo object at 0x1129C770>

in Python 3.6

>>> pickle.loads(pickle.dumps(f, 4))
<__main__.Foo object at 0x1129C770>

So I would keep this in mind, as it is a solved problem.

Critique of the (until Oct 2, 2016) accepted answer

The first paragraph is half short explanation, half predictive. Here"s the only part that actually answers the question

The proper use of __slots__ is to save space in objects. Instead of having a dynamic dict that allows adding attributes to objects at anytime, there is a static structure which does not allow additions after creation. This saves the overhead of one dict for every object that uses slots

The second half is wishful thinking, and off the mark:

While this is sometimes a useful optimization, it would be completely unnecessary if the Python interpreter was dynamic enough so that it would only require the dict when there actually were additions to the object.

Python actually does something similar to this, only creating the __dict__ when it is accessed, but creating lots of objects with no data is fairly ridiculous.

The second paragraph oversimplifies and misses actual reasons to avoid __slots__. The below is not a real reason to avoid slots (for actual reasons, see the rest of my answer above.):

They change the behavior of the objects that have slots in a way that can be abused by control freaks and static typing weenies.

It then goes on to discuss other ways of accomplishing that perverse goal with Python, not discussing anything to do with __slots__.

The third paragraph is more wishful thinking. Together it is mostly off-the-mark content that the answerer didn"t even author and contributes to ammunition for critics of the site.

Memory usage evidence

Create some normal objects and slotted objects:

>>> class Foo(object): pass
>>> class Bar(object): __slots__ = ()

Instantiate a million of them:

>>> foos = [Foo() for f in xrange(1000000)]
>>> bars = [Bar() for b in xrange(1000000)]

Inspect with guppy.hpy().heap():

>>> guppy.hpy().heap()
Partition of a set of 2028259 objects. Total size = 99763360 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 1000000  49 64000000  64  64000000  64 __main__.Foo
     1     169   0 16281480  16  80281480  80 list
     2 1000000  49 16000000  16  96281480  97 __main__.Bar
     3   12284   1   987472   1  97268952  97 str
...

Access the regular objects and their __dict__ and inspect again:

>>> for f in foos:
...     f.__dict__
>>> guppy.hpy().heap()
Partition of a set of 3028258 objects. Total size = 379763480 bytes.
 Index  Count   %      Size    % Cumulative  % Kind (class / dict of class)
     0 1000000  33 280000000  74 280000000  74 dict of __main__.Foo
     1 1000000  33  64000000  17 344000000  91 __main__.Foo
     2     169   0  16281480   4 360281480  95 list
     3 1000000  33  16000000   4 376281480  99 __main__.Bar
     4   12284   0    987472   0 377268952  99 str
...

This is consistent with the history of Python, from Unifying types and classes in Python 2.2

If you subclass a built-in type, extra space is automatically added to the instances to accomodate __dict__ and __weakrefs__. (The __dict__ is not initialized until you use it though, so you shouldn"t worry about the space occupied by an empty dictionary for each instance you create.) If you don"t need this extra space, you can add the phrase "__slots__ = []" to your class.

Answer #2

os.listdir() - list in the current directory

With listdir in os module you get the files and the folders in the current dir

 import os
 arr = os.listdir()
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Looking in a directory

arr = os.listdir("c:\files")

glob from glob

with glob you can specify a type of file to list like this

import glob

txtfiles = []
for file in glob.glob("*.txt"):
    txtfiles.append(file)

glob in a list comprehension

mylist = [f for f in glob.glob("*.txt")]

get the full path of only files in the current directory

import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles) 

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]

Getting the full path name with os.path.abspath

You get the full path in return

 import os
 files_path = [os.path.abspath(x) for x in os.listdir()]
 print(files_path)
 
 ["F:\documentiapplications.txt", "F:\documenticollections.txt"]

Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
    for file in f:
        if file.endswith(".docx"):
            print(os.path.join(r, file))

os.listdir(): get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

 import os
 arr = os.listdir(".")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

To go up in the directory tree

# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")

Get files: os.listdir() in a particular directory (Python 2 and 3)

 import os
 arr = os.listdir("F:\python")
 print(arr)
 
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Get files of a particular subdirectory with os.listdir()

import os

x = os.listdir("./content")

os.walk(".") - current directory

 import os
 arr = next(os.walk("."))[2]
 print(arr)
 
 >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]

next(os.walk(".")) and os.path.join("dir", "file")

 import os
 arr = []
 for d,r,f in next(os.walk("F:\_python")):
     for file in f:
         arr.append(os.path.join(r,file))

 for f in arr:
     print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt

next(os.walk("F:\") - get the full path - list comprehension

 [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
 
 >>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]

os.walk - get full path - all files in sub dirs**

x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

>>> ["F:\_python\dict.py", "F:\_python\progr.txt", "F:\_python\readl.py"]

os.listdir() - get only txt files

 arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
 print(arr_txt)
 
 >>> ["work.txt", "3ebooks.txt"]

Using glob to get the full path of the files

If I should need the absolute path of the files:

from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
    print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt

Using os.path.isfile to avoid directories in the list

import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]

Using pathlib from Python 3.4

import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
    if p.is_file():
        print(p)
        flist.append(p)

 >>> error.PNG
 >>> exemaker.bat
 >>> guiprova.mp3
 >>> setup.py
 >>> speak_gui2.py
 >>> thumb.PNG

With list comprehension:

flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]

Alternatively, use pathlib.Path() instead of pathlib.Path(".")

Use glob method in pathlib.Path()

import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
    print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py

Get all and only files with os.walk

import os
x = [i[2] for i in os.walk(".")]
y=[]
for t in x:
    for f in t:
        y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]

Get only files with next and walk in a directory

 import os
 x = next(os.walk("F://python"))[2]
 print(x)
 
 >>> ["calculator.bat","calculator.py"]

Get only directories with next and walk in a directory

 import os
 next(os.walk("F://python"))[1] # for the current dir use (".")
 
 >>> ["python3","others"]

Get all the subdir names with walk

for r,d,f in os.walk("F:\_python"):
    for dirs in d:
        print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints

os.scandir() from Python 3.5 and greater

import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
    for entry in i:
        if entry.is_file():
            print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG

Examples:

Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

import os

def count(dir, counter=0):
    "returns number of files in dir and subdirs"
    for pack in os.walk(dir):
        for f in pack[2]:
            counter += 1
    return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"

Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
    "Searches for pptx (or other - pptx is the default) files and copies them"
    for pack in os.walk(dir):
        for f in pack[2]:
            if f.endswith(filetype):
                fullpath = pack[0] + "\" + f
                print(fullpath)
                shutil.copy(fullpath, destination)
                counter += 1
    if counter > 0:
        print("-" * 30)
        print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
    "searches for folders that starts with `_`"
    if dir[0] == "_":
        # copyfile(dir, filetype="pdf")
        copyfile(dir, filetype="txt")


>>> _compiti18Compito Contabilità 1conti.txt
>>> _compiti18Compito Contabilità 1modula4.txt
>>> _compiti18Compito Contabilità 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files

Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
    for eachfile in os.listdir():
        mylist += eachfile + "
"
    file.write(mylist)

Example: txt with all the files of an hard drive

"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
    for root, dirs, files in os.walk("D:\"):
        for file in files:
            listafile.append(file)
            percorso.append(root + "\" + file)
            testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
    for file in listafile:
        testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
    for file in percorso:
        file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")

All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
    for r, d, f in os.walk("C:\"):
        for file in f:
            filewrite.write(f"{r + file}
")

How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

import os

def searchfiles(extension=".ttf", folder="H:\"):
    "Create a txt file with all the file of a type"
    with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png

(New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
    "insert all files in the listbox"
    for r, d, f in os.walk(folder):
        for file in f:
            if file.endswith(extension):
                lb.insert(0, r + "\" + file)

def open_file():
    os.startfile(lb.get(lb.curselection()[0]))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()

Answer #3

TL;DR: If you are using Python 3.10 or later, it just works. As of today (2019), in 3.7+ you must turn this feature on using a future statement (from __future__ import annotations). In Python 3.6 or below, use a string.

I guess you got this exception:

NameError: name "Position" is not defined

This is because Position must be defined before you can use it in an annotation unless you are using Python 3.10 or later.

Python 3.7+: from __future__ import annotations

Python 3.7 introduces PEP 563: postponed evaluation of annotations. A module that uses the future statement from __future__ import annotations will store annotations as strings automatically:

from __future__ import annotations

class Position:
    def __add__(self, other: Position) -> Position:
        ...

This is scheduled to become the default in Python 3.10. Since Python still is a dynamically typed language so no type checking is done at runtime, typing annotations should have no performance impact, right? Wrong! Before python 3.7 the typing module used to be one of the slowest python modules in core so if you import typing you will see up to 7 times increase in performance when you upgrade to 3.7.

Python <3.7: use a string

According to PEP 484, you should use a string instead of the class itself:

class Position:
    ...
    def __add__(self, other: "Position") -> "Position":
       ...

If you use the Django framework this may be familiar as Django models also use strings for forward references (foreign key definitions where the foreign model is self or is not declared yet). This should work with Pycharm and other tools.

Sources

The relevant parts of PEP 484 and PEP 563, to spare you the trip:

Forward references

When a type hint contains names that have not been defined yet, that definition may be expressed as a string literal, to be resolved later.

A situation where this occurs commonly is the definition of a container class, where the class being defined occurs in the signature of some of the methods. For example, the following code (the start of a simple binary tree implementation) does not work:

class Tree:
    def __init__(self, left: Tree, right: Tree):
        self.left = left
        self.right = right

To address this, we write:

class Tree:
    def __init__(self, left: "Tree", right: "Tree"):
        self.left = left
        self.right = right

The string literal should contain a valid Python expression (i.e., compile(lit, "", "eval") should be a valid code object) and it should evaluate without errors once the module has been fully loaded. The local and global namespace in which it is evaluated should be the same namespaces in which default arguments to the same function would be evaluated.

and PEP 563:

Implementation

In Python 3.10, function and variable annotations will no longer be evaluated at definition time. Instead, a string form will be preserved in the respective __annotations__ dictionary. Static type checkers will see no difference in behavior, whereas tools using annotations at runtime will have to perform postponed evaluation.

...

Enabling the future behavior in Python 3.7

The functionality described above can be enabled starting from Python 3.7 using the following special import:

from __future__ import annotations

Things that you may be tempted to do instead

A. Define a dummy Position

Before the class definition, place a dummy definition:

class Position(object):
    pass


class Position(object):
    ...

This will get rid of the NameError and may even look OK:

>>> Position.__add__.__annotations__
{"other": __main__.Position, "return": __main__.Position}

But is it?

>>> for k, v in Position.__add__.__annotations__.items():
...     print(k, "is Position:", v is Position)                                                                                                                                                                                                                  
return is Position: False
other is Position: False

B. Monkey-patch in order to add the annotations:

You may want to try some Python meta programming magic and write a decorator to monkey-patch the class definition in order to add annotations:

class Position:
    ...
    def __add__(self, other):
        return self.__class__(self.x + other.x, self.y + other.y)

The decorator should be responsible for the equivalent of this:

Position.__add__.__annotations__["return"] = Position
Position.__add__.__annotations__["other"] = Position

At least it seems right:

>>> for k, v in Position.__add__.__annotations__.items():
...     print(k, "is Position:", v is Position)                                                                                                                                                                                                                  
return is Position: True
other is Position: True

Probably too much trouble.

Answer #4

As you are in python3 , use dict.items() instead of dict.iteritems()

iteritems() was removed in python3, so you can"t use this method anymore.

Take a look at Python 3.0 Wiki Built-in Changes section, where it is stated:

Removed dict.iteritems(), dict.iterkeys(), and dict.itervalues().

Instead: use dict.items(), dict.keys(), and dict.values() respectively.

Answer #5

My quick & dirty JSON dump that eats dates and everything:

json.dumps(my_dictionary, indent=4, sort_keys=True, default=str)

default is a function applied to objects that aren"t serializable.
In this case it"s str, so it just converts everything it doesn"t know to strings. Which is great for serialization but not so great when deserializing (hence the "quick & dirty") as anything might have been string-ified without warning, e.g. a function or numpy array.

Answer #6

a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
unique, counts = numpy.unique(a, return_counts=True)
dict(zip(unique, counts))

# {0: 7, 1: 4, 2: 1, 3: 2, 4: 1}

Non-numpy way:

Use collections.Counter;

import collections, numpy
a = numpy.array([0, 3, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 1, 3, 4])
collections.Counter(a)

# Counter({0: 7, 1: 4, 3: 2, 2: 1, 4: 1})

Answer #7

Because [] and {} are literal syntax. Python can create bytecode just to create the list or dictionary objects:

>>> import dis
>>> dis.dis(compile("[]", "", "eval"))
  1           0 BUILD_LIST               0
              3 RETURN_VALUE        
>>> dis.dis(compile("{}", "", "eval"))
  1           0 BUILD_MAP                0
              3 RETURN_VALUE        

list() and dict() are separate objects. Their names need to be resolved, the stack has to be involved to push the arguments, the frame has to be stored to retrieve later, and a call has to be made. That all takes more time.

For the empty case, that means you have at the very least a LOAD_NAME (which has to search through the global namespace as well as the builtins module) followed by a CALL_FUNCTION, which has to preserve the current frame:

>>> dis.dis(compile("list()", "", "eval"))
  1           0 LOAD_NAME                0 (list)
              3 CALL_FUNCTION            0
              6 RETURN_VALUE        
>>> dis.dis(compile("dict()", "", "eval"))
  1           0 LOAD_NAME                0 (dict)
              3 CALL_FUNCTION            0
              6 RETURN_VALUE        

You can time the name lookup separately with timeit:

>>> import timeit
>>> timeit.timeit("list", number=10**7)
0.30749011039733887
>>> timeit.timeit("dict", number=10**7)
0.4215109348297119

The time discrepancy there is probably a dictionary hash collision. Subtract those times from the times for calling those objects, and compare the result against the times for using literals:

>>> timeit.timeit("[]", number=10**7)
0.30478692054748535
>>> timeit.timeit("{}", number=10**7)
0.31482696533203125
>>> timeit.timeit("list()", number=10**7)
0.9991960525512695
>>> timeit.timeit("dict()", number=10**7)
1.0200958251953125

So having to call the object takes an additional 1.00 - 0.31 - 0.30 == 0.39 seconds per 10 million calls.

You can avoid the global lookup cost by aliasing the global names as locals (using a timeit setup, everything you bind to a name is a local):

>>> timeit.timeit("_list", "_list = list", number=10**7)
0.1866450309753418
>>> timeit.timeit("_dict", "_dict = dict", number=10**7)
0.19016098976135254
>>> timeit.timeit("_list()", "_list = list", number=10**7)
0.841480016708374
>>> timeit.timeit("_dict()", "_dict = dict", number=10**7)
0.7233691215515137

but you never can overcome that CALL_FUNCTION cost.

Answer #8

There are many ways to convert an instance to a dictionary, with varying degrees of corner case handling and closeness to the desired result.


1. instance.__dict__

instance.__dict__

which returns

{"_foreign_key_cache": <OtherModel: OtherModel object>,
 "_state": <django.db.models.base.ModelState at 0x7ff0993f6908>,
 "auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is by far the simplest, but is missing many_to_many, foreign_key is misnamed, and it has two unwanted extra things in it.


2. model_to_dict

from django.forms.models import model_to_dict
model_to_dict(instance)

which returns

{"foreign_key": 2,
 "id": 1,
 "many_to_many": [<OtherModel: OtherModel object>],
 "normal_value": 1}

This is the only one with many_to_many, but is missing the uneditable fields.


3. model_to_dict(..., fields=...)

from django.forms.models import model_to_dict
model_to_dict(instance, fields=[field.name for field in instance._meta.fields])

which returns

{"foreign_key": 2, "id": 1, "normal_value": 1}

This is strictly worse than the standard model_to_dict invocation.


4. query_set.values()

SomeModel.objects.filter(id=instance.id).values()[0]

which returns

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is the same output as instance.__dict__ but without the extra fields. foreign_key_id is still wrong and many_to_many is still missing.


5. Custom Function

The code for django"s model_to_dict had most of the answer. It explicitly removed non-editable fields, so removing that check and getting the ids of foreign keys for many to many fields results in the following code which behaves as desired:

from itertools import chain

def to_dict(instance):
    opts = instance._meta
    data = {}
    for f in chain(opts.concrete_fields, opts.private_fields):
        data[f.name] = f.value_from_object(instance)
    for f in opts.many_to_many:
        data[f.name] = [i.id for i in f.value_from_object(instance)]
    return data

While this is the most complicated option, calling to_dict(instance) gives us exactly the desired result:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

6. Use Serializers

Django Rest Framework"s ModelSerialzer allows you to build a serializer automatically from a model.

from rest_framework import serializers
class SomeModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = SomeModel
        fields = "__all__"

SomeModelSerializer(instance).data

returns

{"auto_now_add": "2018-12-20T21:34:29.494827Z",
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

This is almost as good as the custom function, but auto_now_add is a string instead of a datetime object.


Bonus Round: better model printing

If you want a django model that has a better python command-line display, have your models child-class the following:

from django.db import models
from itertools import chain

class PrintableModel(models.Model):
    def __repr__(self):
        return str(self.to_dict())

    def to_dict(instance):
        opts = instance._meta
        data = {}
        for f in chain(opts.concrete_fields, opts.private_fields):
            data[f.name] = f.value_from_object(instance)
        for f in opts.many_to_many:
            data[f.name] = [i.id for i in f.value_from_object(instance)]
        return data

    class Meta:
        abstract = True

So, for example, if we define our models as such:

class OtherModel(PrintableModel): pass

class SomeModel(PrintableModel):
    normal_value = models.IntegerField()
    readonly_value = models.IntegerField(editable=False)
    auto_now_add = models.DateTimeField(auto_now_add=True)
    foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
    many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")

Calling SomeModel.objects.first() now gives output like this:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

Answer #9

Are dictionaries ordered in Python 3.6+?

They are insertion ordered[1]. As of Python 3.6, for the CPython implementation of Python, dictionaries remember the order of items inserted. This is considered an implementation detail in Python 3.6; you need to use OrderedDict if you want insertion ordering that"s guaranteed across other implementations of Python (and other ordered behavior[1]).

As of Python 3.7, this is no longer an implementation detail and instead becomes a language feature. From a python-dev message by GvR:

Make it so. "Dict keeps insertion order" is the ruling. Thanks!

This simply means that you can depend on it. Other implementations of Python must also offer an insertion ordered dictionary if they wish to be a conforming implementation of Python 3.7.


How does the Python 3.6 dictionary implementation perform better[2] than the older one while preserving element order?

Essentially, by keeping two arrays.

  • The first array, dk_entries, holds the entries (of type PyDictKeyEntry) for the dictionary in the order that they were inserted. Preserving order is achieved by this being an append only array where new items are always inserted at the end (insertion order).

  • The second, dk_indices, holds the indices for the dk_entries array (that is, values that indicate the position of the corresponding entry in dk_entries). This array acts as the hash table. When a key is hashed it leads to one of the indices stored in dk_indices and the corresponding entry is fetched by indexing dk_entries. Since only indices are kept, the type of this array depends on the overall size of the dictionary (ranging from type int8_t(1 byte) to int32_t/int64_t (4/8 bytes) on 32/64 bit builds)

In the previous implementation, a sparse array of type PyDictKeyEntry and size dk_size had to be allocated; unfortunately, it also resulted in a lot of empty space since that array was not allowed to be more than 2/3 * dk_size full for performance reasons. (and the empty space still had PyDictKeyEntry size!).

This is not the case now since only the required entries are stored (those that have been inserted) and a sparse array of type intX_t (X depending on dict size) 2/3 * dk_sizes full is kept. The empty space changed from type PyDictKeyEntry to intX_t.

So, obviously, creating a sparse array of type PyDictKeyEntry is much more memory demanding than a sparse array for storing ints.

You can see the full conversation on Python-Dev regarding this feature if interested, it is a good read.


In the original proposal made by Raymond Hettinger, a visualization of the data structures used can be seen which captures the gist of the idea.

For example, the dictionary:

d = {"timmy": "red", "barry": "green", "guido": "blue"}

is currently stored as [keyhash, key, value]:

entries = [["--", "--", "--"],
           [-8522787127447073495, "barry", "green"],
           ["--", "--", "--"],
           ["--", "--", "--"],
           ["--", "--", "--"],
           [-9092791511155847987, "timmy", "red"],
           ["--", "--", "--"],
           [-6480567542315338377, "guido", "blue"]]

Instead, the data should be organized as follows:

indices =  [None, 1, None, None, None, 0, None, 2]
entries =  [[-9092791511155847987, "timmy", "red"],
            [-8522787127447073495, "barry", "green"],
            [-6480567542315338377, "guido", "blue"]]

As you can visually now see, in the original proposal, a lot of space is essentially empty to reduce collisions and make look-ups faster. With the new approach, you reduce the memory required by moving the sparseness where it"s really required, in the indices.


[1]: I say "insertion ordered" and not "ordered" since, with the existence of OrderedDict, "ordered" suggests further behavior that the `dict` object *doesn"t provide*. OrderedDicts are reversible, provide order sensitive methods and, mainly, provide an order-sensive equality tests (`==`, `!=`). `dict`s currently don"t offer any of those behaviors/methods.
[2]: The new dictionary implementations performs better **memory wise** by being designed more compactly; that"s the main benefit here. Speed wise, the difference isn"t so drastic, there"s places where the new dict might introduce slight regressions ([key-lookups, for example][10]) while in others (iteration and resizing come to mind) a performance boost should be present. Overall, the performance of the dictionary, especially in real-life situations, improves due to the compactness introduced.

Answer #10

json.dumps() converts a dictionary to str object, not a json(dict) object! So you have to load your str into a dict to use it by using json.loads() method

See json.dumps() as a save method and json.loads() as a retrieve method.

This is the code sample which might help you understand it more:

import json

r = {"is_claimed": "True", "rating": 3.5}
r = json.dumps(r)
loaded_r = json.loads(r)
loaded_r["rating"] #Output 3.5
type(r) #Output str
type(loaded_r) #Output dict

python pandas dataframe columns convert to dict key and value: StackOverflow Questions

Convert JSON string to dict using Python

I"m a little bit confused with JSON in Python. To me, it seems like a dictionary, and for that reason I"m trying to do that:

{
    "glossary":
    {
        "title": "example glossary",
        "GlossDiv":
        {
            "title": "S",
            "GlossList":
            {
                "GlossEntry":
                {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef":
                    {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

But when I do print dict(json), it gives an error.

How can I transform this string into a structure and then call json["title"] to obtain "example glossary"?

Convert Django Model object to dict with all of the fields intact

How does one convert a django Model object to a dict with all of its fields? All ideally includes foreign keys and fields with editable=False.

Let me elaborate. Let"s say I have a django model like the following:

from django.db import models

class OtherModel(models.Model): pass

class SomeModel(models.Model):
    normal_value = models.IntegerField()
    readonly_value = models.IntegerField(editable=False)
    auto_now_add = models.DateTimeField(auto_now_add=True)
    foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
    many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")

In the terminal, I have done the following:

other_model = OtherModel()
other_model.save()
instance = SomeModel()
instance.normal_value = 1
instance.readonly_value = 2
instance.foreign_key = other_model
instance.save()
instance.many_to_many.add(other_model)
instance.save()

I want to convert this to the following dictionary:

{"auto_now_add": datetime.datetime(2015, 3, 16, 21, 34, 14, 926738, tzinfo=<UTC>),
 "foreign_key": 1,
 "id": 1,
 "many_to_many": [1],
 "normal_value": 1,
 "readonly_value": 2}

Questions with unsatisfactory answers:

Django: Converting an entire set of a Model's objects into a single dictionary

How can I turn Django Model objects into a dictionary and still have their foreign keys?

Converting JSON String to Dictionary Not List

I am trying to pass in a JSON file and convert the data into a dictionary.

So far, this is what I have done:

import json
json1_file = open("json1")
json1_str = json1_file.read()
json1_data = json.loads(json1_str)

I"m expecting json1_data to be a dict type but it actually comes out as a list type when I check it with type(json1_data).

What am I missing? I need this to be a dictionary so I can access one of the keys.

python tuple to dict

For the tuple, t = ((1, "a"),(2, "b")) dict(t) returns {1: "a", 2: "b"}

Is there a good way to get {"a": 1, "b": 2} (keys and vals swapped)?

Ultimately, I want to be able to return 1 given "a" or 2 given "b", perhaps converting to a dict is not the best way.

String to Dictionary in Python

So I"ve spent way to much time on this, and it seems to me like it should be a simple fix. I"m trying to use Facebook"s Authentication to register users on my site, and I"m trying to do it server side. I"ve gotten to the point where I get my access token, and when I go to:

https://graph.facebook.com/me?access_token=MY_ACCESS_TOKEN

I get the information I"m looking for as a string that"s like this:

{"id":"123456789";"name":"John Doe";"first_name":"John";"last_name":"Doe";"link":"http://www.facebook.com/jdoe";"gender":"male";"email":"jdoeu0040gmail.com";"timezone":-7,"locale":"en_US";"verified":true,"updated_time":"2011-01-12T02:43:35+0000"}

It seems like I should just be able to use dict(string) on this but I"m getting this error:

ValueError: dictionary update sequence element #0 has length 1; 2 is required

So I tried using Pickle, but got this error:

KeyError: "{"

I tried using django.serializers to de-serialize it but had similar results. Any thoughts? I feel like the answer has to be simple, and I"m just being stupid. Thanks for any help!

python pandas dataframe to dictionary

I"ve a two columns dataframe, and intend to convert it to python dictionary - the first column will be the key and the second will be the value. Thank you in advance.

Dataframe:

    id    value
0    0     10.2
1    1      5.7
2    2      7.4

How to convert list of key-value tuples into dictionary?

I have a list that looks like:

[("A", 1), ("B", 2), ("C", 3)]

I want to turn it into a dictionary that looks like:

{"A": 1, "B": 2, "C": 3}

What"s the best way to go about this?

EDIT: My list of tuples is actually more like:

[(A, 12937012397), (BERA, 2034927830), (CE, 2349057340)]

I am getting the error ValueError: dictionary update sequence element #0 has length 1916; 2 is required

URL query parameters to dict python

Is there a way to parse a URL (with some python library) and return a python dictionary with the keys and values of a query parameters part of the URL?

For example:

url = "http://www.example.org/default.html?ct=32&op=92&item=98"

expected return:

{"ct":32, "op":92, "item":98}

List of tuples to dictionary

Here"s how I"m currently converting a list of tuples to dictionary in Python:

l = [("a",1),("b",2)]
h = {}
[h.update({k:v}) for k,v in l]
> [None, None]
h
> {"a": 1, "b": 2}

Is there a better way? It seems like there should be a one-liner to do this.

python pandas dataframe columns convert to dict key and value

I have a pandas data frame with multiple columns and I would like to construct a dict from two columns: one as the dict"s keys and the other as the dict"s values. How can I do that?

Dataframe:

           area  count
co tp
DE Lake      10      7
Forest       20      5
FR Lake      30      2
Forest       40      3

I need to define area as key, count as value in dict. Thank you in advance.

Answer #1

There are many ways to convert an instance to a dictionary, with varying degrees of corner case handling and closeness to the desired result.


1. instance.__dict__

instance.__dict__

which returns

{"_foreign_key_cache": <OtherModel: OtherModel object>,
 "_state": <django.db.models.base.ModelState at 0x7ff0993f6908>,
 "auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is by far the simplest, but is missing many_to_many, foreign_key is misnamed, and it has two unwanted extra things in it.


2. model_to_dict

from django.forms.models import model_to_dict
model_to_dict(instance)

which returns

{"foreign_key": 2,
 "id": 1,
 "many_to_many": [<OtherModel: OtherModel object>],
 "normal_value": 1}

This is the only one with many_to_many, but is missing the uneditable fields.


3. model_to_dict(..., fields=...)

from django.forms.models import model_to_dict
model_to_dict(instance, fields=[field.name for field in instance._meta.fields])

which returns

{"foreign_key": 2, "id": 1, "normal_value": 1}

This is strictly worse than the standard model_to_dict invocation.


4. query_set.values()

SomeModel.objects.filter(id=instance.id).values()[0]

which returns

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key_id": 2,
 "id": 1,
 "normal_value": 1,
 "readonly_value": 2}

This is the same output as instance.__dict__ but without the extra fields. foreign_key_id is still wrong and many_to_many is still missing.


5. Custom Function

The code for django"s model_to_dict had most of the answer. It explicitly removed non-editable fields, so removing that check and getting the ids of foreign keys for many to many fields results in the following code which behaves as desired:

from itertools import chain

def to_dict(instance):
    opts = instance._meta
    data = {}
    for f in chain(opts.concrete_fields, opts.private_fields):
        data[f.name] = f.value_from_object(instance)
    for f in opts.many_to_many:
        data[f.name] = [i.id for i in f.value_from_object(instance)]
    return data

While this is the most complicated option, calling to_dict(instance) gives us exactly the desired result:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

6. Use Serializers

Django Rest Framework"s ModelSerialzer allows you to build a serializer automatically from a model.

from rest_framework import serializers
class SomeModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = SomeModel
        fields = "__all__"

SomeModelSerializer(instance).data

returns

{"auto_now_add": "2018-12-20T21:34:29.494827Z",
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

This is almost as good as the custom function, but auto_now_add is a string instead of a datetime object.


Bonus Round: better model printing

If you want a django model that has a better python command-line display, have your models child-class the following:

from django.db import models
from itertools import chain

class PrintableModel(models.Model):
    def __repr__(self):
        return str(self.to_dict())

    def to_dict(instance):
        opts = instance._meta
        data = {}
        for f in chain(opts.concrete_fields, opts.private_fields):
            data[f.name] = f.value_from_object(instance)
        for f in opts.many_to_many:
            data[f.name] = [i.id for i in f.value_from_object(instance)]
        return data

    class Meta:
        abstract = True

So, for example, if we define our models as such:

class OtherModel(PrintableModel): pass

class SomeModel(PrintableModel):
    normal_value = models.IntegerField()
    readonly_value = models.IntegerField(editable=False)
    auto_now_add = models.DateTimeField(auto_now_add=True)
    foreign_key = models.ForeignKey(OtherModel, related_name="ref1")
    many_to_many = models.ManyToManyField(OtherModel, related_name="ref2")

Calling SomeModel.objects.first() now gives output like this:

{"auto_now_add": datetime.datetime(2018, 12, 20, 21, 34, 29, 494827, tzinfo=<UTC>),
 "foreign_key": 2,
 "id": 1,
 "many_to_many": [2],
 "normal_value": 1,
 "readonly_value": 2}

Answer #2

Use df.to_dict("records") -- gives the output without having to transpose externally.

In [2]: df.to_dict("records")
Out[2]:
[{"customer": 1L, "item1": "apple", "item2": "milk", "item3": "tomato"},
 {"customer": 2L, "item1": "water", "item2": "orange", "item3": "potato"},
 {"customer": 3L, "item1": "juice", "item2": "mango", "item3": "chips"}]

Answer #3

Edit

As John Galt mentions in his answer , you should probably instead use df.to_dict("records"). It"s faster than transposing manually.

In [20]: timeit df.T.to_dict().values()
1000 loops, best of 3: 395 µs per loop

In [21]: timeit df.to_dict("records")
10000 loops, best of 3: 53 µs per loop

Original answer

Use df.T.to_dict().values(), like below:

In [1]: df
Out[1]:
   customer  item1   item2   item3
0         1  apple    milk  tomato
1         2  water  orange  potato
2         3  juice   mango   chips

In [2]: df.T.to_dict().values()
Out[2]:
[{"customer": 1.0, "item1": "apple", "item2": "milk", "item3": "tomato"},
 {"customer": 2.0, "item1": "water", "item2": "orange", "item3": "potato"},
 {"customer": 3.0, "item1": "juice", "item2": "mango", "item3": "chips"}]

Answer #4

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {"col1" : [1, 2, 3, 4, 5, 3], 
                           "col2" : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {"col1" : [1, 2, 3],
                           "col2" : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=["col1","col2"], 
                   how="left", indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

df_all["_merge"] == "left_only"

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=["col1","col2"])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict("l")).all(1)

Answer #5

How do I convert a list of dictionaries to a pandas DataFrame?

The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.


DataFrame(), DataFrame.from_records(), and .from_dict()

Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don"t work at all.

Consider a very contrived example.

np.random.seed(0)
data = pd.DataFrame(
    np.random.choice(10, (3, 4)), columns=list("ABCD")).to_dict("r")

print(data)
[{"A": 5, "B": 0, "C": 3, "D": 3},
 {"A": 7, "B": 9, "C": 3, "D": 5},
 {"A": 2, "B": 4, "C": 7, "D": 6}]

This list consists of "records" with every keys present. This is the simplest case you could encounter.

# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Word on Dictionary Orientations: orient="index"/"columns"

Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".

orient="columns"
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.

For example, data above is in the "columns" orient.

data_c = [
 {"A": 5, "B": 0, "C": 3, "D": 3},
 {"A": 7, "B": 9, "C": 3, "D": 5},
 {"A": 2, "B": 4, "C": 7, "D": 6}]
pd.DataFrame.from_dict(data_c, orient="columns")

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.

orient="index"
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.

data_i ={
 0: {"A": 5, "B": 0, "C": 3, "D": 3},
 1: {"A": 7, "B": 9, "C": 3, "D": 5},
 2: {"A": 2, "B": 4, "C": 7, "D": 6}}
pd.DataFrame.from_dict(data_i, orient="index")

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6

This case is not considered in the OP, but is still useful to know.

Setting Custom Index

If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.

pd.DataFrame(data, index=["a", "b", "c"])
# pd.DataFrame.from_records(data, index=["a", "b", "c"])

   A  B  C  D
a  5  0  3  3
b  7  9  3  5
c  2  4  7  6

This is not supported by pd.DataFrame.from_dict.

Dealing with Missing Keys/Columns

All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,

data2 = [
     {"A": 5, "C": 3, "D": 3},
     {"A": 7, "B": 9, "F": 5},
     {"B": 4, "C": 7, "E": 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)

     A    B    C    D    E    F
0  5.0  NaN  3.0  3.0  NaN  NaN
1  7.0  9.0  NaN  NaN  NaN  5.0
2  NaN  4.0  7.0  NaN  6.0  NaN

Reading Subset of Columns

"What if I don"t want to read in every single column"? You can easily specify this using the columns=... parameter.

For example, from the example dictionary of data2 above, if you wanted to read only columns "A", "D", and "F", you can do so by passing a list:

pd.DataFrame(data2, columns=["A", "D", "F"])
# pd.DataFrame.from_records(data2, columns=["A", "D", "F"])

     A    D    F
0  5.0  3.0  NaN
1  7.0  NaN  5.0
2  NaN  NaN  NaN

This is not supported by pd.DataFrame.from_dict with the default orient "columns".

pd.DataFrame.from_dict(data2, orient="columns", columns=["A", "B"])
ValueError: cannot use columns parameter with orient="columns"

Reading Subset of Rows

Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:

rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
    if i not in rows_to_select:
        del data2[i]

pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

The Panacea: json_normalize for Nested Data

A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.

pd.json_normalize(data)

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
pd.json_normalize(data2)

     A    B  C    D    E
0  5.0  NaN  3  3.0  NaN
1  NaN  4.0  7  NaN  6.0

Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.

As mentioned, json_normalize can also handle nested dictionaries. Here"s an example taken from the documentation.

data_nested = [
  {"counties": [{"name": "Dade", "population": 12345},
                {"name": "Broward", "population": 40000},
                {"name": "Palm Beach", "population": 60000}],
   "info": {"governor": "Rick Scott"},
   "shortname": "FL",
   "state": "Florida"},
  {"counties": [{"name": "Summit", "population": 1234},
                {"name": "Cuyahoga", "population": 1337}],
   "info": {"governor": "John Kasich"},
   "shortname": "OH",
   "state": "Ohio"}
]
pd.json_normalize(data_nested, 
                          record_path="counties", 
                          meta=["state", "shortname", ["info", "governor"]])

         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

For more information on the meta and record_path arguments, check out the documentation.


Summarising

Here"s a table of all the methods discussed above, along with supported features/functionality.

enter image description here

* Use orient="columns" and then transpose to get the same effect as orient="index".

Answer #6

TLDR; No, for loops are not blanket "bad", at least, not always. It is probably more accurate to say that some vectorized operations are slower than iterating, versus saying that iteration is faster than some vectorized operations. Knowing when and why is key to getting the most performance out of your code. In a nutshell, these are the situations where it is worth considering an alternative to vectorized pandas functions:

  1. When your data is small (...depending on what you"re doing),
  2. When dealing with object/mixed dtypes
  3. When using the str/regex accessor functions

Let"s examine these situations individually.


Iteration v/s Vectorization on Small Data

Pandas follows a "Convention Over Configuration" approach in its API design. This means that the same API has been fitted to cater to a broad range of data and use cases.

When a pandas function is called, the following things (among others) must internally be handled by the function, to ensure working

  1. Index/axis alignment
  2. Handling mixed datatypes
  3. Handling missing data

Almost every function will have to deal with these to varying extents, and this presents an overhead. The overhead is less for numeric functions (for example, Series.add), while it is more pronounced for string functions (for example, Series.str.replace).

for loops, on the other hand, are faster then you think. What"s even better is list comprehensions (which create lists through for loops) are even faster as they are optimized iterative mechanisms for list creation.

List comprehensions follow the pattern

[f(x) for x in seq]

Where seq is a pandas series or DataFrame column. Or, when operating over multiple columns,

[f(x, y) for x, y in zip(seq1, seq2)]

Where seq1 and seq2 are columns.

Numeric Comparison
Consider a simple boolean indexing operation. The list comprehension method has been timed against Series.ne (!=) and query. Here are the functions:

# Boolean indexing with Numeric value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

For simplicity, I have used the perfplot package to run all the timeit tests in this post. The timings for the operations above are below:

enter image description here

The list comprehension outperforms query for moderately sized N, and even outperforms the vectorized not equals comparison for tiny N. Unfortunately, the list comprehension scales linearly, so it does not offer much performance gain for larger N.

Note
It is worth mentioning that much of the benefit of list comprehension come from not having to worry about the index alignment, but this means that if your code is dependent on indexing alignment, this will break. In some cases, vectorised operations over the underlying NumPy arrays can be considered as bringing in the "best of both worlds", allowing for vectorisation without all the unneeded overhead of the pandas functions. This means that you can rewrite the operation above as

df[df.A.values != df.B.values]

Which outperforms both the pandas and list comprehension equivalents:

NumPy vectorization is out of the scope of this post, but it is definitely worth considering, if performance matters.

Value Counts
Taking another example - this time, with another vanilla python construct that is faster than a for loop - collections.Counter. A common requirement is to compute the value counts and return the result as a dictionary. This is done with value_counts, np.unique, and Counter:

# Value Counts comparison.
ser.value_counts(sort=False).to_dict()           # value_counts
dict(zip(*np.unique(ser, return_counts=True)))   # np.unique
Counter(ser)                                     # Counter

enter image description here

The results are more pronounced, Counter wins out over both vectorized methods for a larger range of small N (~3500).

Note
More trivia (courtesy @user2357112). The Counter is implemented with a C accelerator, so while it still has to work with python objects instead of the underlying C datatypes, it is still faster than a for loop. Python power!

Of course, the take away from here is that the performance depends on your data and use case. The point of these examples is to convince you not to rule out these solutions as legitimate options. If these still don"t give you the performance you need, there is always cython and numba. Let"s add this test into the mix.

from numba import njit, prange

@njit(parallel=True)
def get_mask(x, y):
    result = [False] * len(x)
    for i in prange(len(x)):
        result[i] = x[i] != y[i]

    return np.array(result)

df[get_mask(df.A.values, df.B.values)] # numba

enter image description here

Numba offers JIT compilation of loopy python code to very powerful vectorized code. Understanding how to make numba work involves a learning curve.


Operations with Mixed/object dtypes

String-based Comparison
Revisiting the filtering example from the first section, what if the columns being compared are strings? Consider the same 3 functions above, but with the input DataFrame cast to string.

# Boolean indexing with string value comparison.
df[df.A != df.B]                            # vectorized !=
df.query("A != B")                          # query (numexpr)
df[[x != y for x, y in zip(df.A, df.B)]]    # list comp

enter image description here

So, what changed? The thing to note here is that string operations are inherently difficult to vectorize. Pandas treats strings as objects, and all operations on objects fall back to a slow, loopy implementation.

Now, because this loopy implementation is surrounded by all the overhead mentioned above, there is a constant magnitude difference between these solutions, even though they scale the same.

When it comes to operations on mutable/complex objects, there is no comparison. List comprehension outperforms all operations involving dicts and lists.

Accessing Dictionary Value(s) by Key
Here are timings for two operations that extract a value from a column of dictionaries: map and the list comprehension. The setup is in the Appendix, under the heading "Code Snippets".

# Dictionary value extraction.
ser.map(operator.itemgetter("value"))     # map
pd.Series([x.get("value") for x in ser])  # list comprehension

enter image description here

Positional List Indexing
Timings for 3 operations that extract the 0th element from a list of columns (handling exceptions), map, str.get accessor method, and the list comprehension:

# List positional indexing. 
def get_0th(lst):
    try:
        return lst[0]
    # Handle empty lists and NaNs gracefully.
    except (IndexError, TypeError):
        return np.nan

ser.map(get_0th)                                          # map
ser.str[0]                                                # str accessor
pd.Series([x[0] if len(x) > 0 else np.nan for x in ser])  # list comp
pd.Series([get_0th(x) for x in ser])                      # list comp safe

Note
If the index matters, you would want to do:

pd.Series([...], index=ser.index)

When reconstructing the series.

enter image description here

List Flattening
A final example is flattening lists. This is another common problem, and demonstrates just how powerful pure python is here.

# Nested list flattening.
pd.DataFrame(ser.tolist()).stack().reset_index(drop=True)  # stack
pd.Series(list(chain.from_iterable(ser.tolist())))         # itertools.chain
pd.Series([y for x in ser for y in x])                     # nested list comp

enter image description here

Both itertools.chain.from_iterable and the nested list comprehension are pure python constructs, and scale much better than the stack solution.

These timings are a strong indication of the fact that pandas is not equipped to work with mixed dtypes, and that you should probably refrain from using it to do so. Wherever possible, data should be present as scalar values (ints/floats/strings) in separate columns.

Lastly, the applicability of these solutions depend widely on your data. So, the best thing to do would be to test these operations on your data before deciding what to go with. Notice how I have not timed apply on these solutions, because it would skew the graph (yes, it"s that slow).


Regex Operations, and .str Accessor Methods

Pandas can apply regex operations such as str.contains, str.extract, and str.extractall, as well as other "vectorized" string operations (such as str.split, str.find,str.translate`, and so on) on string columns. These functions are slower than list comprehensions, and are meant to be more convenience functions than anything else.

It is usually much faster to pre-compile a regex pattern and iterate over your data with re.compile (also see Is it worth using Python's re.compile?). The list comp equivalent to str.contains looks something like this:

p = re.compile(...)
ser2 = pd.Series([x for x in ser if p.search(x)])

Or,

ser2 = ser[[bool(p.search(x)) for x in ser]]

If you need to handle NaNs, you can do something like

ser[[bool(p.search(x)) if pd.notnull(x) else False for x in ser]]

The list comp equivalent to str.extract (without groups) will look something like:

df["col2"] = [p.search(x).group(0) for x in df["col"]]

If you need to handle no-matches and NaNs, you can use a custom function (still faster!):

def matcher(x):
    m = p.search(str(x))
    if m:
        return m.group(0)
    return np.nan

df["col2"] = [matcher(x) for x in df["col"]]

The matcher function is very extensible. It can be fitted to return a list for each capture group, as needed. Just extract query the group or groups attribute of the matcher object.

For str.extractall, change p.search to p.findall.

String Extraction
Consider a simple filtering operation. The idea is to extract 4 digits if it is preceded by an upper case letter.

# Extracting strings.
p = re.compile(r"(?<=[A-Z])(d{4})")
def matcher(x):
    m = p.search(x)
    if m:
        return m.group(0)
    return np.nan

ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False)   #  str.extract
pd.Series([matcher(x) for x in ser])                  #  list comprehension

enter image description here

More Examples
Full disclosure - I am the author (in part or whole) of these posts listed below.


Conclusion

As shown from the examples above, iteration shines when working with small rows of DataFrames, mixed datatypes, and regular expressions.

The speedup you get depends on your data and your problem, so your mileage may vary. The best thing to do is to carefully run tests and see if the payout is worth the effort.

The "vectorized" functions shine in their simplicity and readability, so if performance is not critical, you should definitely prefer those.

Another side note, certain string operations deal with constraints that favour the use of NumPy. Here are two examples where careful NumPy vectorization outperforms python:

Additionally, sometimes just operating on the underlying arrays via .values as opposed to on the Series or DataFrames can offer a healthy enough speedup for most usual scenarios (see the Note in the Numeric Comparison section above). So, for example df[df.A.values != df.B.values] would show instant performance boosts over df[df.A != df.B]. Using .values may not be appropriate in every situation, but it is a useful hack to know.

As mentioned above, it"s up to you to decide whether these solutions are worth the trouble of implementing.


Appendix: Code Snippets

import perfplot  
import operator 
import pandas as pd
import numpy as np
import re

from collections import Counter
from itertools import chain

# Boolean indexing with Numeric value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"]),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query("A != B"),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: df[get_mask(df.A.values, df.B.values)]
    ],
    labels=["vectorized !=", "query (numexpr)", "list comp", "numba"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N"
)

# Value Counts comparison.
perfplot.show(
    setup=lambda n: pd.Series(np.random.choice(1000, n)),
    kernels=[
        lambda ser: ser.value_counts(sort=False).to_dict(),
        lambda ser: dict(zip(*np.unique(ser, return_counts=True))),
        lambda ser: Counter(ser),
    ],
    labels=["value_counts", "np.unique", "Counter"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=lambda x, y: dict(x) == dict(y)
)

# Boolean indexing with string value comparison.
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=["A","B"], dtype=str),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df.query("A != B"),
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
    ],
    labels=["vectorized !=", "query (numexpr)", "list comp"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# Dictionary value extraction.
ser1 = pd.Series([{"key": "abc", "value": 123}, {"key": "xyz", "value": 456}])
perfplot.show(
    setup=lambda n: pd.concat([ser1] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(operator.itemgetter("value")),
        lambda ser: pd.Series([x.get("value") for x in ser]),
    ],
    labels=["map", "list comprehension"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# List positional indexing. 
ser2 = pd.Series([["a", "b", "c"], [1, 2], []])        
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.map(get_0th),
        lambda ser: ser.str[0],
        lambda ser: pd.Series([x[0] if len(x) > 0 else np.nan for x in ser]),
        lambda ser: pd.Series([get_0th(x) for x in ser]),
    ],
    labels=["map", "str accessor", "list comprehension", "list comp safe"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

# Nested list flattening.
ser3 = pd.Series([["a", "b", "c"], ["d", "e"], ["f", "g"]])
perfplot.show(
    setup=lambda n: pd.concat([ser2] * n, ignore_index=True),
    kernels=[
        lambda ser: pd.DataFrame(ser.tolist()).stack().reset_index(drop=True),
        lambda ser: pd.Series(list(chain.from_iterable(ser.tolist()))),
        lambda ser: pd.Series([y for x in ser for y in x]),
    ],
    labels=["stack", "itertools.chain", "nested list comp"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",    
    equality_check=None

)

# Extracting strings.
ser4 = pd.Series(["foo xyz", "test A1234", "D3345 xtz"])
perfplot.show(
    setup=lambda n: pd.concat([ser4] * n, ignore_index=True),
    kernels=[
        lambda ser: ser.str.extract(r"(?<=[A-Z])(d{4})", expand=False),
        lambda ser: pd.Series([matcher(x) for x in ser])
    ],
    labels=["str.extract", "list comprehension"],
    n_range=[2**k for k in range(0, 15)],
    xlabel="N",
    equality_check=None
)

Answer #7

I"d like to shed a little bit more light on the interplay of iter, __iter__ and __getitem__ and what happens behind the curtains. Armed with that knowledge, you will be able to understand why the best you can do is

try:
    iter(maybe_iterable)
    print("iteration will probably work")
except TypeError:
    print("not iterable")

I will list the facts first and then follow up with a quick reminder of what happens when you employ a for loop in python, followed by a discussion to illustrate the facts.

Facts

  1. You can get an iterator from any object o by calling iter(o) if at least one of the following conditions holds true:

    a) o has an __iter__ method which returns an iterator object. An iterator is any object with an __iter__ and a __next__ (Python 2: next) method.

    b) o has a __getitem__ method.

  2. Checking for an instance of Iterable or Sequence, or checking for the attribute __iter__ is not enough.

  3. If an object o implements only __getitem__, but not __iter__, iter(o) will construct an iterator that tries to fetch items from o by integer index, starting at index 0. The iterator will catch any IndexError (but no other errors) that is raised and then raises StopIteration itself.

  4. In the most general sense, there"s no way to check whether the iterator returned by iter is sane other than to try it out.

  5. If an object o implements __iter__, the iter function will make sure that the object returned by __iter__ is an iterator. There is no sanity check if an object only implements __getitem__.

  6. __iter__ wins. If an object o implements both __iter__ and __getitem__, iter(o) will call __iter__.

  7. If you want to make your own objects iterable, always implement the __iter__ method.

for loops

In order to follow along, you need an understanding of what happens when you employ a for loop in Python. Feel free to skip right to the next section if you already know.

When you use for item in o for some iterable object o, Python calls iter(o) and expects an iterator object as the return value. An iterator is any object which implements a __next__ (or next in Python 2) method and an __iter__ method.

By convention, the __iter__ method of an iterator should return the object itself (i.e. return self). Python then calls next on the iterator until StopIteration is raised. All of this happens implicitly, but the following demonstration makes it visible:

import random

class DemoIterable(object):
    def __iter__(self):
        print("__iter__ called")
        return DemoIterator()

class DemoIterator(object):
    def __iter__(self):
        return self

    def __next__(self):
        print("__next__ called")
        r = random.randint(1, 10)
        if r == 5:
            print("raising StopIteration")
            raise StopIteration
        return r

Iteration over a DemoIterable:

>>> di = DemoIterable()
>>> for x in di:
...     print(x)
...
__iter__ called
__next__ called
9
__next__ called
8
__next__ called
10
__next__ called
3
__next__ called
10
__next__ called
raising StopIteration

Discussion and illustrations

On point 1 and 2: getting an iterator and unreliable checks

Consider the following class:

class BasicIterable(object):
    def __getitem__(self, item):
        if item == 3:
            raise IndexError
        return item

Calling iter with an instance of BasicIterable will return an iterator without any problems because BasicIterable implements __getitem__.

>>> b = BasicIterable()
>>> iter(b)
<iterator object at 0x7f1ab216e320>

However, it is important to note that b does not have the __iter__ attribute and is not considered an instance of Iterable or Sequence:

>>> from collections import Iterable, Sequence
>>> hasattr(b, "__iter__")
False
>>> isinstance(b, Iterable)
False
>>> isinstance(b, Sequence)
False

This is why Fluent Python by Luciano Ramalho recommends calling iter and handling the potential TypeError as the most accurate way to check whether an object is iterable. Quoting directly from the book:

As of Python 3.4, the most accurate way to check whether an object x is iterable is to call iter(x) and handle a TypeError exception if it isn’t. This is more accurate than using isinstance(x, abc.Iterable) , because iter(x) also considers the legacy __getitem__ method, while the Iterable ABC does not.

On point 3: Iterating over objects which only provide __getitem__, but not __iter__

Iterating over an instance of BasicIterable works as expected: Python constructs an iterator that tries to fetch items by index, starting at zero, until an IndexError is raised. The demo object"s __getitem__ method simply returns the item which was supplied as the argument to __getitem__(self, item) by the iterator returned by iter.

>>> b = BasicIterable()
>>> it = iter(b)
>>> next(it)
0
>>> next(it)
1
>>> next(it)
2
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Note that the iterator raises StopIteration when it cannot return the next item and that the IndexError which is raised for item == 3 is handled internally. This is why looping over a BasicIterable with a for loop works as expected:

>>> for x in b:
...     print(x)
...
0
1
2

Here"s another example in order to drive home the concept of how the iterator returned by iter tries to access items by index. WrappedDict does not inherit from dict, which means instances won"t have an __iter__ method.

class WrappedDict(object): # note: no inheritance from dict!
    def __init__(self, dic):
        self._dict = dic

    def __getitem__(self, item):
        try:
            return self._dict[item] # delegate to dict.__getitem__
        except KeyError:
            raise IndexError

Note that calls to __getitem__ are delegated to dict.__getitem__ for which the square bracket notation is simply a shorthand.

>>> w = WrappedDict({-1: "not printed",
...                   0: "hi", 1: "StackOverflow", 2: "!",
...                   4: "not printed", 
...                   "x": "not printed"})
>>> for x in w:
...     print(x)
... 
hi
StackOverflow
!

On point 4 and 5: iter checks for an iterator when it calls __iter__:

When iter(o) is called for an object o, iter will make sure that the return value of __iter__, if the method is present, is an iterator. This means that the returned object must implement __next__ (or next in Python 2) and __iter__. iter cannot perform any sanity checks for objects which only provide __getitem__, because it has no way to check whether the items of the object are accessible by integer index.

class FailIterIterable(object):
    def __iter__(self):
        return object() # not an iterator

class FailGetitemIterable(object):
    def __getitem__(self, item):
        raise Exception

Note that constructing an iterator from FailIterIterable instances fails immediately, while constructing an iterator from FailGetItemIterable succeeds, but will throw an Exception on the first call to __next__.

>>> fii = FailIterIterable()
>>> iter(fii)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: iter() returned non-iterator of type "object"
>>>
>>> fgi = FailGetitemIterable()
>>> it = iter(fgi)
>>> next(it)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/iterdemo.py", line 42, in __getitem__
    raise Exception
Exception

On point 6: __iter__ wins

This one is straightforward. If an object implements __iter__ and __getitem__, iter will call __iter__. Consider the following class

class IterWinsDemo(object):
    def __iter__(self):
        return iter(["__iter__", "wins"])

    def __getitem__(self, item):
        return ["__getitem__", "wins"][item]

and the output when looping over an instance:

>>> iwd = IterWinsDemo()
>>> for x in iwd:
...     print(x)
...
__iter__
wins

On point 7: your iterable classes should implement __iter__

You might ask yourself why most builtin sequences like list implement an __iter__ method when __getitem__ would be sufficient.

class WrappedList(object): # note: no inheritance from list!
    def __init__(self, lst):
        self._list = lst

    def __getitem__(self, item):
        return self._list[item]

After all, iteration over instances of the class above, which delegates calls to __getitem__ to list.__getitem__ (using the square bracket notation), will work fine:

>>> wl = WrappedList(["A", "B", "C"])
>>> for x in wl:
...     print(x)
... 
A
B
C

The reasons your custom iterables should implement __iter__ are as follows:

  1. If you implement __iter__, instances will be considered iterables, and isinstance(o, collections.abc.Iterable) will return True.
  2. If the object returned by __iter__ is not an iterator, iter will fail immediately and raise a TypeError.
  3. The special handling of __getitem__ exists for backwards compatibility reasons. Quoting again from Fluent Python:

That is why any Python sequence is iterable: they all implement __getitem__ . In fact, the standard sequences also implement __iter__, and yours should too, because the special handling of __getitem__ exists for backward compatibility reasons and may be gone in the future (although it is not deprecated as I write this).

Answer #8

  • Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then)

  • Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

  • Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

  • Parquet is a standard storage format for analytics that"s supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

The benchmarks you showed are going to be very noisy since the data you read and wrote is very small. You should try compressing at least 100MB or upwards 1GB of data to get some more informative benchmarks, see e.g. http://wesmckinney.com/blog/python-parquet-multithreading/

Hope this helps

Answer #9

There are at least five six ways. The preferred way depends on what your use case is.

Option 1:

Simply add an asdict() method.

Based on the problem description I would very much consider the asdict way of doing things suggested by other answers. This is because it does not appear that your object is really much of a collection:

class Wharrgarbl(object):

    ...

    def asdict(self):
        return {"a": self.a, "b": self.b, "c": self.c}

Using the other options below could be confusing for others unless it is very obvious exactly which object members would and would not be iterated or specified as key-value pairs.

Option 1a:

Inherit your class from "typing.NamedTuple" (or the mostly equivalent "collections.namedtuple"), and use the _asdict method provided for you.

from typing import NamedTuple

class Wharrgarbl(NamedTuple):
    a: str
    b: str
    c: str
    sum: int = 6
    version: str = "old"

Using a named tuple is a very convenient way to add lots of functionality to your class with a minimum of effort, including an _asdict method. However, a limitation is that, as shown above, the NT will include all the members in its _asdict.

If there are members you don"t want to include in your dictionary, you"ll need to modify the _asdict result:

from typing import NamedTuple

class Wharrgarbl(NamedTuple):
    a: str
    b: str
    c: str
    sum: int = 6
    version: str = "old"

    def _asdict(self):
        d = super()._asdict()
        del d["sum"]
        del d["version"]
        return d

Another limitation is that NT is read-only. This may or may not be desirable.

Option 2:

Implement __iter__.

Like this, for example:

def __iter__(self):
    yield "a", self.a
    yield "b", self.b
    yield "c", self.c

Now you can just do:

dict(my_object)

This works because the dict() constructor accepts an iterable of (key, value) pairs to construct a dictionary. Before doing this, ask yourself the question whether iterating the object as a series of key,value pairs in this manner- while convenient for creating a dict- might actually be surprising behavior in other contexts. E.g., ask yourself the question "what should the behavior of list(my_object) be...?"

Additionally, note that accessing values directly using the get item obj["a"] syntax will not work, and keyword argument unpacking won"t work. For those, you"d need to implement the mapping protocol.

Option 3:

Implement the mapping protocol. This allows access-by-key behavior, casting to a dict without using __iter__, and also provides unpacking behavior ({**my_obj}) and keyword unpacking behavior if all the keys are strings (dict(**my_obj)).

The mapping protocol requires that you provide (at minimum) two methods together: keys() and __getitem__.

class MyKwargUnpackable:
    def keys(self):
        return list("abc")
    def __getitem__(self, key):
        return dict(zip("abc", "one two three".split()))[key]

Now you can do things like:

>>> m=MyKwargUnpackable()
>>> m["a"]
"one"
>>> dict(m)  # cast to dict directly
{"a": "one", "b": "two", "c": "three"}
>>> dict(**m)  # unpack as kwargs
{"a": "one", "b": "two", "c": "three"}

As mentioned above, if you are using a new enough version of python you can also unpack your mapping-protocol object into a dictionary comprehension like so (and in this case it is not required that your keys be strings):

>>> {**m}
{"a": "one", "b": "two", "c": "three"}

Note that the mapping protocol takes precedence over the __iter__ method when casting an object to a dict directly (without using kwarg unpacking, i.e. dict(m)). So it is possible- and sometimes convenient- to cause the object to have different behavior when used as an iterable (e.g., list(m)) vs. when cast to a dict (dict(m)).

EMPHASIZED: Just because you CAN use the mapping protocol, does NOT mean that you SHOULD do so. Does it actually make sense for your object to be passed around as a set of key-value pairs, or as keyword arguments and values? Does accessing it by key- just like a dictionary- really make sense?

If the answer to these questions is yes, it"s probably a good idea to consider the next option.

Option 4:

Look into using the "collections.abc" module.

Inheriting your class from "collections.abc.Mapping or "collections.abc.MutableMapping signals to other users that, for all intents and purposes, your class is a mapping * and can be expected to behave that way.

You can still cast your object to a dict just as you require, but there would probably be little reason to do so. Because of duck typing, bothering to cast your mapping object to a dict would just be an additional unnecessary step the majority of the time.

This answer might also be helpful.

As noted in the comments below: it"s worth mentioning that doing this the abc way essentially turns your object class into a dict-like class (assuming you use MutableMapping and not the read-only Mapping base class). Everything you would be able to do with dict, you could do with your own class object. This may be, or may not be, desirable.

Also consider looking at the numerical abcs in the numbers module:

https://docs.python.org/3/library/numbers.html

Since you"re also casting your object to an int, it might make more sense to essentially turn your class into a full fledged int so that casting isn"t necessary.

Option 5:

Look into using the dataclasses module (Python 3.7 only), which includes a convenient asdict() utility method.

from dataclasses import dataclass, asdict, field, InitVar

@dataclass
class Wharrgarbl(object):
    a: int
    b: int
    c: int
    sum: InitVar[int]  # note: InitVar will exclude this from the dict
    version: InitVar[str] = "old"

    def __post_init__(self, sum, version):
        self.sum = 6  # this looks like an OP mistake?
        self.version = str(version)

Now you can do this:

    >>> asdict(Wharrgarbl(1,2,3,4,"X"))
    {"a": 1, "b": 2, "c": 3}

Option 6:

Use typing.TypedDict, which has been added in python 3.8.

NOTE: option 6 is likely NOT what the OP, or other readers based on the title of this question, are looking for. See additional comments below.

class Wharrgarbl(TypedDict):
    a: str
    b: str
    c: str

Using this option, the resulting object is a dict (emphasis: it is not a Wharrgarbl). There is no reason at all to "cast" it to a dict (unless you are making a copy).

And since the object is a dict, the initialization signature is identical to that of dict and as such it only accepts keyword arguments or another dictionary.

    >>> w = Wharrgarbl(a=1,b=2,b=3)
    >>> w
    {"a": 1, "b": 2, "c": 3}
    >>> type(w)
    <class "dict">

Emphasized: the above "class" Wharrgarbl isn"t actually a new class at all. It is simply syntactic sugar for creating typed dict objects with fields of different types for the type checker.

As such this option can be pretty convenient for signaling to readers of your code (and also to a type checker such as mypy) that such a dict object is expected to have specific keys with specific value types.

But this means you cannot, for example, add other methods, although you can try:

class MyDict(TypedDict):
    def my_fancy_method(self):
        return "world changing result"

...but it won"t work:

>>> MyDict().my_fancy_method()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: "dict" object has no attribute "my_fancy_method"

* "Mapping" has become the standard "name" of the dict-like duck type

Answer #10

Or if you are already using pandas, You can do it with json_normalize() like so:

import pandas as pd

d = {"a": 1,
     "c": {"a": 2, "b": {"x": 5, "y" : 10}},
     "d": [1, 2, 3]}

df = pd.json_normalize(d, sep="_")

print(df.to_dict(orient="records")[0])

Output:

{"a": 1, "c_a": 2, "c_b_x": 5, "c_b_y": 10, "d": [1, 2, 3]}

Get Solution for free from DataCamp guru