psycopg2: insert multiple rows with one query

insert | StackOverflow

I need to insert multiple rows with one query (number of rows is not constant), so I need to execute query like this one:

INSERT INTO t (a, b) VALUES (1, 2), (3, 4), (5, 6);

The only way I know is

args = [(1,2), (3,4), (5,6)]
args_str = ",".join(cursor.mogrify("%s", (x, )) for x in args)
cursor.execute("INSERT INTO t (a, b) VALUES "+args_str)

but I want some simpler way.

Answer rating: 258

I built a program that inserts multiple lines to a server that was located in another city.

I found out that using this method was about 10 times faster than executemany. In my case tup is a tuple containing about 2000 rows. It took about 10 seconds when using this method:

args_str = ",".join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup)
cur.execute("INSERT INTO table VALUES " + args_str) 

and 2 minutes when using this method:

cur.executemany("INSERT INTO table VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s)", tup)

Answer rating: 196

New execute_values method in Psycopg 2.7:

data = [(1,"x"), (2,"y")]
insert_query = "insert into t (a, b) values %s"
psycopg2.extras.execute_values (
    cursor, insert_query, data, template=None, page_size=100

The pythonic way of doing it in Psycopg 2.6:

data = [(1,"x"), (2,"y")]
records_list_template = ",".join(["%s"] * len(data))
insert_query = "insert into t (a, b) values {}".format(records_list_template)
cursor.execute(insert_query, data)

Explanation: If the data to be inserted is given as a list of tuples like in

data = [(1,"x"), (2,"y")]

then it is already in the exact required format as

  1. the values syntax of the insert clause expects a list of records as in

    insert into t (a, b) values (1, "x"),(2, "y")

  2. Psycopg adapts a Python tuple to a Postgresql record.

The only necessary work is to provide a records list template to be filled by psycopg

# We use the data list to be sure of the template length
records_list_template = ",".join(["%s"] * len(data))

and place it in the insert query

insert_query = "insert into t (a, b) values {}".format(records_list_template)

Printing the insert_query outputs

insert into t (a, b) values %s,%s

Now to the usual Psycopg arguments substitution

cursor.execute(insert_query, data)

Or just testing what will be sent to the server

print (cursor.mogrify(insert_query, data).decode("utf8"))


insert into t (a, b) values (1, "x"),(2, "y")

Answer rating: 88

Update with psycopg2 2.7:

The classic executemany() is about 60 times slower than @ant32 "s implementation (called "folded") as explained in this thread:

This implementation was added to psycopg2 in version 2.7 and is called execute_values():

from psycopg2.extras import execute_values
    "INSERT INTO test (id, v1, v2) VALUES %s",
    [(1, 2, 3), (4, 5, 6), (7, 8, 9)])

Previous Answer:

To insert multiple rows, using the multirow VALUES syntax with execute() is about 10x faster than using psycopg2 executemany(). Indeed, executemany() just runs many individual INSERT statements.

@ant32 "s code works perfectly in Python 2. But in Python 3, cursor.mogrify() returns bytes, cursor.execute() takes either bytes or strings, and ",".join() expects str instance.

So in Python 3 you may need to modify @ant32 "s code, by adding .decode("utf-8"):

args_str = ",".join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x).decode("utf-8") for x in tup)
cur.execute("INSERT INTO table VALUES " + args_str)

Or by using bytes (with b"" or b"") only:

args_bytes = b",".join(cur.mogrify("(%s,%s,%s,%s,%s,%s,%s,%s,%s)", x) for x in tup)
cur.execute(b"INSERT INTO table VALUES " + args_bytes) 

psycopg2: insert multiple rows with one query: StackOverflow Questions

How to insert newlines on argparse help text?

I"m using argparse in Python 2.7 for parsing input options. One of my options is a multiple choice. I want to make a list in its help text, e.g.

from argparse import ArgumentParser

parser = ArgumentParser(description="test")

parser.add_argument("-g", choices=["a", "b", "g", "d", "e"], default="a",
    help="Some option, where
         " a = alpha
         " b = beta
         " g = gamma
         " d = delta
         " e = epsilon")


However, argparse strips all newlines and consecutive spaces. The result looks like

~/Downloads:52$ python2.7 -h
usage: [-h] [-g {a,b,g,d,e}]


optional arguments:
  -h, --help      show this help message and exit
  -g {a,b,g,d,e}  Some option, where a = alpha b = beta g = gamma d = delta e
                  = epsilon

How to insert newlines in the help text?

Is a Python list guaranteed to have its elements stay in the order they are inserted in?

If I have the following Python code

>>> x = []
>>> x = x + [1]
>>> x = x + [2]
>>> x = x + [3]
>>> x
[1, 2, 3]

Will x be guaranteed to always be [1,2,3], or are other orderings of the interim elements possible?

Inserting image into IPython notebook markdown

I am starting to depend heavily on the IPython notebook app to develop and document algorithms. It is awesome; but there is something that seems like it should be possible, but I can"t figure out how to do it:

I would like to insert a local image into my (local) IPython notebook markdown to aid in documenting an algorithm. I know enough to add something like <img src="image.png"> to the markdown, but that is about as far as my knowledge goes. I assume I could put the image in the directory represented by (or some subdirectory) to be able to access it, but I can"t figure out where that directory is. (I"m working on a mac.) So, is it possible to do what I"m trying to do without too much trouble?

how do I insert a column at a specific column index in pandas?

Can I insert a column at a specific column index in pandas?

import pandas as pd
df = pd.DataFrame({"l":["a","b","c","d"], "v":[1,2,1,2]})
df["n"] = 0

This will put column n as the last column of df, but isn"t there a way to tell df to put n at the beginning?

What is the syntax to insert one list into another list in python?

Given two lists:

x = [1,2,3]
y = [4,5,6]

What is the syntax to:

  1. Insert x into y such that y now looks like [1, 2, 3, [4, 5, 6]]?
  2. Insert all the items of x into y such that y now looks like [1, 2, 3, 4, 5, 6]?

How to retrieve inserted id after inserting row in SQLite using Python?

How to retrieve inserted id after inserting row in SQLite using Python? I have table like this:

username VARCHAR(50),
password VARCHAR(50)

I insert a new row with example data username="test" and password="test". How do I retrieve the generated id in a transaction safe way? This is for a website solution, where two people may be inserting data at the same time. I know I can get the last read row, but I don"t think that is transaction safe. Can somebody give me some advice?

How do I get the "id" after INSERT into MySQL database with Python?

I execute an INSERT INTO statement

cursor.execute("INSERT INTO mytable(height) VALUES(%s)",(height))

and I want to get the primary key.

My table has 2 columns:

id      primary, auto increment
height  this is the other column.

How do I get the "id", after I just inserted this?

psycopg2: insert multiple rows with one query

I need to insert multiple rows with one query (number of rows is not constant), so I need to execute query like this one:

INSERT INTO t (a, b) VALUES (1, 2), (3, 4), (5, 6);

The only way I know is

args = [(1,2), (3,4), (5,6)]
args_str = ",".join(cursor.mogrify("%s", (x, )) for x in args)
cursor.execute("INSERT INTO t (a, b) VALUES "+args_str)

but I want some simpler way.

Insert at first position of a list in Python

How can I insert an element at the first index of a list? If I use list.insert(0, elem), does elem modify the content of the first index? Or do I have to create a new list with the first elem and then copy the old list inside this new one?

mongodb: insert if not exists

Every day, I receive a stock of documents (an update). What I want to do is insert each item that does not already exist.

  • I also want to keep track of the first time I inserted them, and the last time I saw them in an update.
  • I don"t want to have duplicate documents.
  • I don"t want to remove a document which has previously been saved, but is not in my update.
  • 95% (estimated) of the records are unmodified from day to day.

I am using the Python driver (pymongo).

What I currently do is (pseudo-code):

for each document in update:
      existing_document = collection.find_one(document)
      if not existing_document:
           document["insertion_date"] = now
           document = existing_document
      document["last_update_date"] = now

My problem is that it is very slow (40 mins for less than 100 000 records, and I have millions of them in the update). I am pretty sure there is something builtin for doing this, but the document for update() is mmmhhh.... a bit terse.... ( )

Can someone advise how to do it faster?

Answer #1

os.listdir() - list in the current directory

With listdir in os module you get the files and the folders in the current dir

 import os
 arr = os.listdir()
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Looking in a directory

arr = os.listdir("c:\files")

glob from glob

with glob you can specify a type of file to list like this

import glob

txtfiles = []
for file in glob.glob("*.txt"):

glob in a list comprehension

mylist = [f for f in glob.glob("*.txt")]

get the full path of only files in the current directory

import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if 
os.path.isfile(os.path.join(cwd, f))]

["G:\getfilesname\", "G:\getfilesname\example.txt"]

Getting the full path name with os.path.abspath

You get the full path in return

 import os
 files_path = [os.path.abspath(x) for x in os.listdir()]
 ["F:\documentiapplications.txt", "F:\documenticollections.txt"]

Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
    for file in f:
        if file.endswith(".docx"):
            print(os.path.join(r, file))

os.listdir(): get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

 import os
 arr = os.listdir(".")
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

To go up in the directory tree

# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")

Get files: os.listdir() in a particular directory (Python 2 and 3)

 import os
 arr = os.listdir("F:\python")
 >>> ["$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]

Get files of a particular subdirectory with os.listdir()

import os

x = os.listdir("./content")

os.walk(".") - current directory

 import os
 arr = next(os.walk("."))[2]
 >>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]

next(os.walk(".")) and os.path.join("dir", "file")

 import os
 arr = []
 for d,r,f in next(os.walk("F:\_python")):
     for file in f:

 for f in arr:

>>> F:\_python\
>>> F:\_python\programmi.txt

next(os.walk("F:\") - get the full path - list comprehension

 [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]
 >>> ["F:\_python\", "F:\_python\programmi.txt"]

os.walk - get full path - all files in sub dirs**

x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]

>>> ["F:\_python\", "F:\_python\progr.txt", "F:\_python\"]

os.listdir() - get only txt files

 arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
 >>> ["work.txt", "3ebooks.txt"]

Using glob to get the full path of the files

If I should need the absolute path of the files:

from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt

Using os.path.isfile to avoid directories in the list

import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]

>>> ["a simple", "data.txt", ""]

Using pathlib from Python 3.4

import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
    if p.is_file():

 >>> error.PNG
 >>> exemaker.bat
 >>> guiprova.mp3
 >>> thumb.PNG

With list comprehension:

flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]

Alternatively, use pathlib.Path() instead of pathlib.Path(".")

Use glob method in pathlib.Path()

import pathlib

py = pathlib.Path().glob("*.py")
for file in py:


Get all and only files with os.walk

import os
x = [i[2] for i in os.walk(".")]
for t in x:
    for f in t:

>>> ["", "data.txt", "data1.txt", "data2.txt", "data_180617", "", "", "", "", "", "", "data.txt", "data1.txt", "data_180617"]

Get only files with next and walk in a directory

 import os
 x = next(os.walk("F://python"))[2]
 >>> ["calculator.bat",""]

Get only directories with next and walk in a directory

 import os
 next(os.walk("F://python"))[1] # for the current dir use (".")
 >>> ["python3","others"]

Get all the subdir names with walk

for r,d,f in os.walk("F:\_python"):
    for dirs in d:

>>> .vscode
>>> pyexcel
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints

os.scandir() from Python 3.5 and greater

import os
x = [ for f in os.scandir() if f.is_file()]

>>> ["calculator.bat",""]

# Another example with scandir (a little variation from
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
    for entry in i:
        if entry.is_file():

>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> thumb.PNG


Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

import os

def count(dir, counter=0):
    "returns number of files in dir and subdirs"
    for pack in os.walk(dir):
        for f in pack[2]:
            counter += 1
    return dir + " : " + str(counter) + "files"


>>> "F:\python" : 12057 files"

Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
    "Searches for pptx (or other - pptx is the default) files and copies them"
    for pack in os.walk(dir):
        for f in pack[2]:
            if f.endswith(filetype):
                fullpath = pack[0] + "\" + f
                shutil.copy(fullpath, destination)
                counter += 1
    if counter > 0:
        print("-" * 30)
        print("	==> Found in: `" + dir + "` : " + str(counter) + " files

for dir in os.listdir():
    "searches for folders that starts with `_`"
    if dir[0] == "_":
        # copyfile(dir, filetype="pdf")
        copyfile(dir, filetype="txt")

>>> _compiti18Compito Contabilità 1conti.txt
>>> _compiti18Compito Contabilità 1modula4.txt
>>> _compiti18Compito Contabilità 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files

Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
    for eachfile in os.listdir():
        mylist += eachfile + "

Example: txt with all the files of an hard drive

We are going to save a txt file with all the files in your directory.
We will use the function walk()

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
    for root, dirs, files in os.walk("D:\"):
        for file in files:
            percorso.append(root + "\" + file)
            testo.write(file + "
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
    for file in listafile:
        testo_ordinato.write(file + "

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
    for file in percorso:
        file_percorso.write(file + "


All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
    for r, d, f in os.walk("C:\"):
        for file in f:
            filewrite.write(f"{r + file}

How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

import os

def searchfiles(extension=".ttf", folder="H:\"):
    "Create a txt file with all the file of a type"
    with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
        for r, d, f in os.walk(folder):
            for file in f:
                if file.endswith(extension):
                    filewrite.write(f"{r + file}

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png

(New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. enter image description here

import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
    "insert all files in the listbox"
    for r, d, f in os.walk(folder):
        for file in f:
            if file.endswith(extension):
                lb.insert(0, r + "\" + file)

def open_file():

root = tk.Tk()
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())

Answer #2

Is there any reason for a class declaration to inherit from object?

In Python 3, apart from compatibility between Python 2 and 3, no reason. In Python 2, many reasons.

Python 2.x story:

In Python 2.x (from 2.2 onwards) there"s two styles of classes depending on the presence or absence of object as a base-class:

  1. "classic" style classes: they don"t have object as a base class:

    >>> class ClassicSpam:      # no base class
    ...     pass
    >>> ClassicSpam.__bases__
  2. "new" style classes: they have, directly or indirectly (e.g inherit from a built-in type), object as a base class:

    >>> class NewSpam(object):           # directly inherit from object
    ...    pass
    >>> NewSpam.__bases__
    (<type "object">,)
    >>> class IntSpam(int):              # indirectly inherit from object...
    ...    pass
    >>> IntSpam.__bases__
    (<type "int">,) 
    >>> IntSpam.__bases__[0].__bases__   # ... because int inherits from object  
    (<type "object">,)

Without a doubt, when writing a class you"ll always want to go for new-style classes. The perks of doing so are numerous, to list some of them:

  • Support for descriptors. Specifically, the following constructs are made possible with descriptors:

    1. classmethod: A method that receives the class as an implicit argument instead of the instance.
    2. staticmethod: A method that does not receive the implicit argument self as a first argument.
    3. properties with property: Create functions for managing the getting, setting and deleting of an attribute.
    4. __slots__: Saves memory consumptions of a class and also results in faster attribute access. Of course, it does impose limitations.
  • The __new__ static method: lets you customize how new class instances are created.

  • Method resolution order (MRO): in what order the base classes of a class will be searched when trying to resolve which method to call.

  • Related to MRO, super calls. Also see, super() considered super.

If you don"t inherit from object, forget these. A more exhaustive description of the previous bullet points along with other perks of "new" style classes can be found here.

One of the downsides of new-style classes is that the class itself is more memory demanding. Unless you"re creating many class objects, though, I doubt this would be an issue and it"s a negative sinking in a sea of positives.

Python 3.x story:

In Python 3, things are simplified. Only new-style classes exist (referred to plainly as classes) so, the only difference in adding object is requiring you to type in 8 more characters. This:

class ClassicSpam:

is completely equivalent (apart from their name :-) to this:

class NewSpam(object):

and to this:

class Spam():

All have object in their __bases__.

>>> [object in cls.__bases__ for cls in {Spam, NewSpam, ClassicSpam}]
[True, True, True]

So, what should you do?

In Python 2: always inherit from object explicitly. Get the perks.

In Python 3: inherit from object if you are writing code that tries to be Python agnostic, that is, it needs to work both in Python 2 and in Python 3. Otherwise don"t, it really makes no difference since Python inserts it for you behind the scenes.

Answer #3

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here"s what this post will go through:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • merging with multiple columns
    • avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

Enough talk - just show me how to use merge!

Setup & Basics

left = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
right = pd.DataFrame({"key": ["B", "D", "E", "F"], "value": np.random.randn(4)})


  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893


  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

left.merge(right, on="key")
# Or, if you want to be explicit
# left.merge(right, on="key", how="inner")

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how="left".

left.merge(right, on="key", how="left")

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how="left", then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how="right":

left.merge(right, on="key", how="right")

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how="outer".

left.merge(right, on="key", how="outer")

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely:

Enter image description here

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left only,

(left.merge(right, on="key", how="left", indicator=True)
     .query("_merge == "left_only"")
     .drop("_merge", 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN


left.merge(right, on="key", how="left", indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on="key", how="right", indicator=True)
     .query("_merge == "right_only"")
     .drop("_merge", 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on="key", how="outer", indicator=True)
     .query("_merge != "both"")
     .drop("_merge", 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({"key":"keyLeft"}, axis=1)
right2 = right.rename({"key":"keyRight"}, axis=1)


  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893


  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index("keyLeft")
left3.merge(right2, left_index=True, right_on="keyRight")

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")), you"ll notice keyLeft is missing. You can figure out what column to keep based on which frame"s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[["key", "newcol"]], on="key")

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you"re doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left["newcol"] = left["key"].map(right3.set_index("key")["newcol"]))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[["key", "newcol"]], on="key", how="left")

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=["key1", "key2"] ...)

Or, in the event the names are different,

left.merge(right, left_on=["lkey1", "lkey2"], right_on=["rkey1", "rkey2"])

Other useful merge* operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

*You are here.

Answer #4

Are dictionaries ordered in Python 3.6+?

They are insertion ordered[1]. As of Python 3.6, for the CPython implementation of Python, dictionaries remember the order of items inserted. This is considered an implementation detail in Python 3.6; you need to use OrderedDict if you want insertion ordering that"s guaranteed across other implementations of Python (and other ordered behavior[1]).

As of Python 3.7, this is no longer an implementation detail and instead becomes a language feature. From a python-dev message by GvR:

Make it so. "Dict keeps insertion order" is the ruling. Thanks!

This simply means that you can depend on it. Other implementations of Python must also offer an insertion ordered dictionary if they wish to be a conforming implementation of Python 3.7.

How does the Python 3.6 dictionary implementation perform better[2] than the older one while preserving element order?

Essentially, by keeping two arrays.

  • The first array, dk_entries, holds the entries (of type PyDictKeyEntry) for the dictionary in the order that they were inserted. Preserving order is achieved by this being an append only array where new items are always inserted at the end (insertion order).

  • The second, dk_indices, holds the indices for the dk_entries array (that is, values that indicate the position of the corresponding entry in dk_entries). This array acts as the hash table. When a key is hashed it leads to one of the indices stored in dk_indices and the corresponding entry is fetched by indexing dk_entries. Since only indices are kept, the type of this array depends on the overall size of the dictionary (ranging from type int8_t(1 byte) to int32_t/int64_t (4/8 bytes) on 32/64 bit builds)

In the previous implementation, a sparse array of type PyDictKeyEntry and size dk_size had to be allocated; unfortunately, it also resulted in a lot of empty space since that array was not allowed to be more than 2/3 * dk_size full for performance reasons. (and the empty space still had PyDictKeyEntry size!).

This is not the case now since only the required entries are stored (those that have been inserted) and a sparse array of type intX_t (X depending on dict size) 2/3 * dk_sizes full is kept. The empty space changed from type PyDictKeyEntry to intX_t.

So, obviously, creating a sparse array of type PyDictKeyEntry is much more memory demanding than a sparse array for storing ints.

You can see the full conversation on Python-Dev regarding this feature if interested, it is a good read.

In the original proposal made by Raymond Hettinger, a visualization of the data structures used can be seen which captures the gist of the idea.

For example, the dictionary:

d = {"timmy": "red", "barry": "green", "guido": "blue"}

is currently stored as [keyhash, key, value]:

entries = [["--", "--", "--"],
           [-8522787127447073495, "barry", "green"],
           ["--", "--", "--"],
           ["--", "--", "--"],
           ["--", "--", "--"],
           [-9092791511155847987, "timmy", "red"],
           ["--", "--", "--"],
           [-6480567542315338377, "guido", "blue"]]

Instead, the data should be organized as follows:

indices =  [None, 1, None, None, None, 0, None, 2]
entries =  [[-9092791511155847987, "timmy", "red"],
            [-8522787127447073495, "barry", "green"],
            [-6480567542315338377, "guido", "blue"]]

As you can visually now see, in the original proposal, a lot of space is essentially empty to reduce collisions and make look-ups faster. With the new approach, you reduce the memory required by moving the sparseness where it"s really required, in the indices.

[1]: I say "insertion ordered" and not "ordered" since, with the existence of OrderedDict, "ordered" suggests further behavior that the `dict` object *doesn"t provide*. OrderedDicts are reversible, provide order sensitive methods and, mainly, provide an order-sensive equality tests (`==`, `!=`). `dict`s currently don"t offer any of those behaviors/methods.
[2]: The new dictionary implementations performs better **memory wise** by being designed more compactly; that"s the main benefit here. Speed wise, the difference isn"t so drastic, there"s places where the new dict might introduce slight regressions ([key-lookups, for example][10]) while in others (iteration and resizing come to mind) a performance boost should be present. Overall, the performance of the dictionary, especially in real-life situations, improves due to the compactness introduced.

Answer #5

Calculate timestamps within your DB, not your client

For sanity, you probably want to have all datetimes calculated by your DB server, rather than the application server. Calculating the timestamp in the application can lead to problems because network latency is variable, clients experience slightly different clock drift, and different programming languages occasionally calculate time slightly differently.

SQLAlchemy allows you to do this by passing or func.current_timestamp() (they are aliases of each other) which tells the DB to calculate the timestamp itself.

Use SQLALchemy"s server_default

Additionally, for a default where you"re already telling the DB to calculate the value, it"s generally better to use server_default instead of default. This tells SQLAlchemy to pass the default value as part of the CREATE TABLE statement.

For example, if you write an ad hoc script against this table, using server_default means you won"t need to worry about manually adding a timestamp call to your script--the database will set it automatically.

Understanding SQLAlchemy"s onupdate/server_onupdate

SQLAlchemy also supports onupdate so that anytime the row is updated it inserts a new timestamp. Again, best to tell the DB to calculate the timestamp itself:

from sqlalchemy.sql import func

time_created = Column(DateTime(timezone=True),
time_updated = Column(DateTime(timezone=True),

There is a server_onupdate parameter, but unlike server_default, it doesn"t actually set anything serverside. It just tells SQLalchemy that your database will change the column when an update happens (perhaps you created a trigger on the column ), so SQLAlchemy will ask for the return value so it can update the corresponding object.

One other potential gotcha:

You might be surprised to notice that if you make a bunch of changes within a single transaction, they all have the same timestamp. That"s because the SQL standard specifies that CURRENT_TIMESTAMP returns values based on the start of the transaction.

PostgreSQL provides the non-SQL-standard statement_timestamp() and clock_timestamp() which do change within a transaction. Docs here:

UTC timestamp

If you want to use UTC timestamps, a stub of implementation for func.utcnow() is provided in SQLAlchemy documentation. You need to provide appropriate driver-specific functions on your own though.

Answer #6

Simply put, numpy.newaxis is used to increase the dimension of the existing array by one more dimension, when used once. Thus,

  • 1D array will become 2D array

  • 2D array will become 3D array

  • 3D array will become 4D array

  • 4D array will become 5D array

and so on..

Here is a visual illustration which depicts promotion of 1D array to 2D arrays.

newaxis canva visualization

Scenario-1: np.newaxis might come in handy when you want to explicitly convert a 1D array to either a row vector or a column vector, as depicted in the above picture.


# 1D array
In [7]: arr = np.arange(4)
In [8]: arr.shape
Out[8]: (4,)

# make it as row vector by inserting an axis along first dimension
In [9]: row_vec = arr[np.newaxis, :]     # arr[None, :]
In [10]: row_vec.shape
Out[10]: (1, 4)

# make it as column vector by inserting an axis along second dimension
In [11]: col_vec = arr[:, np.newaxis]     # arr[:, None]
In [12]: col_vec.shape
Out[12]: (4, 1)

Scenario-2: When we want to make use of numpy broadcasting as part of some operation, for instance while doing addition of some arrays.


Let"s say you want to add the following two arrays:

 x1 = np.array([1, 2, 3, 4, 5])
 x2 = np.array([5, 4, 3])

If you try to add these just like that, NumPy will raise the following ValueError :

ValueError: operands could not be broadcast together with shapes (5,) (3,)

In this situation, you can use np.newaxis to increase the dimension of one of the arrays so that NumPy can broadcast.

In [2]: x1_new = x1[:, np.newaxis]    # x1[:, None]
# now, the shape of x1_new is (5, 1)
# array([[1],
#        [2],
#        [3],
#        [4],
#        [5]])

Now, add:

In [3]: x1_new + x2
array([[ 6,  5,  4],
       [ 7,  6,  5],
       [ 8,  7,  6],
       [ 9,  8,  7],
       [10,  9,  8]])

Alternatively, you can also add new axis to the array x2:

In [6]: x2_new = x2[:, np.newaxis]    # x2[:, None]
In [7]: x2_new     # shape is (3, 1)

Now, add:

In [8]: x1 + x2_new
array([[ 6,  7,  8,  9, 10],
       [ 5,  6,  7,  8,  9],
       [ 4,  5,  6,  7,  8]])

Note: Observe that we get the same result in both cases (but one being the transpose of the other).

Scenario-3: This is similar to scenario-1. But, you can use np.newaxis more than once to promote the array to higher dimensions. Such an operation is sometimes needed for higher order arrays (i.e. Tensors).


In [124]: arr = np.arange(5*5).reshape(5,5)

In [125]: arr.shape
Out[125]: (5, 5)

# promoting 2D array to a 5D array
In [126]: arr_5D = arr[np.newaxis, ..., np.newaxis, np.newaxis]    # arr[None, ..., None, None]

In [127]: arr_5D.shape
Out[127]: (1, 5, 5, 1, 1)

As an alternative, you can use numpy.expand_dims that has an intuitive axis kwarg.

# adding new axes at 1st, 4th, and last dimension of the resulting array
In [131]: newaxes = (0, 3, -1)
In [132]: arr_5D = np.expand_dims(arr, axis=newaxes)
In [133]: arr_5D.shape
Out[133]: (1, 5, 5, 1, 1)

More background on np.newaxis vs np.reshape

newaxis is also called as a pseudo-index that allows the temporary addition of an axis into a multiarray.

np.newaxis uses the slicing operator to recreate the array while numpy.reshape reshapes the array to the desired layout (assuming that the dimensions match; And this is must for a reshape to happen).


In [13]: A = np.ones((3,4,5,6))
In [14]: B = np.ones((4,6))
In [15]: (A + B[:, np.newaxis, :]).shape     # B[:, None, :]
Out[15]: (3, 4, 5, 6)

In the above example, we inserted a temporary axis between the first and second axes of B (to use broadcasting). A missing axis is filled-in here using np.newaxis to make the broadcasting operation work.

General Tip: You can also use None in place of np.newaxis; These are in fact the same objects.

In [13]: np.newaxis is None
Out[13]: True

P.S. Also see this great answer: newaxis vs reshape to add dimensions

Answer #7

We start by answering the first question:

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(["row", "col"]).any()


So when I pivot using

df.pivot(index="row", columns="col", values="val0")

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(["row", "col"])["val0"].unstack()

Here is a list of idioms we can use to pivot

  1. pd.DataFrame.groupby + pd.DataFrame.unstack

    • Good general approach for doing just about any type of pivot
    • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
  2. pd.DataFrame.pivot_table

    • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
    • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
  3. pd.DataFrame.set_index + pd.DataFrame.unstack

    • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
    • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
  4. pd.DataFrame.pivot

    • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
    • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
  5. pd.crosstab

    • This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
  6. pd.factorize + np.bincount

    • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
  7. pd.get_dummies +

    • I use this for cleverly performing cross tabulation.


What I"m going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I"ll provide alternatives to perform the same task.

Question 3

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

  • pd.DataFrame.pivot_table

    • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it"s the same as this answer without the fill_value

    • aggfunc="mean" is the default and I didn"t have to set it. I included it to be explicit.

              values="val0", index="row", columns="col",
              fill_value=0, aggfunc="mean")
          col   col0   col1   col2   col3  col4
          row0  0.77  0.605  0.000  0.860  0.65
          row2  0.13  0.000  0.395  0.500  0.25
          row3  0.00  0.310  0.000  0.545  0.00
          row4  0.00  0.100  0.395  0.760  0.24
  • pd.DataFrame.groupby

      df.groupby(["row", "col"])["val0"].mean().unstack(fill_value=0)
  • pd.crosstab

          index=df["row"], columns=df["col"],
          values=df["val0"], aggfunc="mean").fillna(0)

Question 4

Can I get something other than mean, like maybe sum?

  • pd.DataFrame.pivot_table

          values="val0", index="row", columns="col",
          fill_value=0, aggfunc="sum")
      col   col0  col1  col2  col3  col4
      row0  0.77  1.21  0.00  0.86  0.65
      row2  0.13  0.00  0.79  0.50  0.50
      row3  0.00  0.31  0.00  1.09  0.00
      row4  0.00  0.10  0.79  1.52  0.24
  • pd.DataFrame.groupby

      df.groupby(["row", "col"])["val0"].sum().unstack(fill_value=0)
  • pd.crosstab

          index=df["row"], columns=df["col"],
          values=df["val0"], aggfunc="sum").fillna(0)

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

  • pd.DataFrame.pivot_table

          values="val0", index="row", columns="col",
          fill_value=0, aggfunc=[np.size, np.mean])
           size                      mean
      col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
      row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
      row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
      row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
      row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
  • pd.DataFrame.groupby

      df.groupby(["row", "col"])["val0"].agg(["size", "mean"]).unstack(fill_value=0)
  • pd.crosstab

          index=df["row"], columns=df["col"],
          values=df["val0"], aggfunc=[np.size, np.mean]).fillna(0, downcast="infer")

Question 6

Can I aggregate over multiple value columns?

  • pd.DataFrame.pivot_table we pass values=["val0", "val1"] but we could"ve left that off completely

          values=["val0", "val1"], index="row", columns="col",
          fill_value=0, aggfunc="mean")
            val0                             val1
      col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
      row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
      row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
      row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
      row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
  • pd.DataFrame.groupby

      df.groupby(["row", "col"])["val0", "val1"].mean().unstack(fill_value=0)

Question 7

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

          values="val0", index="row", columns=["item", "col"],
          fill_value=0, aggfunc="mean")
      item item0             item1                         item2
      col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
      row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
      row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
      row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
      row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
  • pd.DataFrame.groupby

          ["row", "item", "col"]
      )["val0"].mean().unstack(["item", "col"]).fillna(0).sort_index(1)

Question 8

Can Subdivide by multiple columns?

  • pd.DataFrame.pivot_table

          values="val0", index=["key", "row"], columns=["item", "col"],
          fill_value=0, aggfunc="mean")
      item      item0             item1                         item2
      col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
      key  row
      key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
           row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
           row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
           row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
      key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
           row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
           row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
           row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
      key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
           row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
           row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
  • pd.DataFrame.groupby

          ["key", "row", "item", "col"]
      )["val0"].mean().unstack(["item", "col"]).fillna(0).sort_index(1)
  • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

          ["key", "row", "item", "col"]
      )["val0"].unstack(["item", "col"]).fillna(0).sort_index(1)

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

  • pd.DataFrame.pivot_table

      df.pivot_table(index="row", columns="col", fill_value=0, aggfunc="size")
          col   col0  col1  col2  col3  col4
      row0     1     2     0     1     1
      row2     1     0     2     1     2
      row3     0     1     0     2     0
      row4     0     1     2     2     1
  • pd.DataFrame.groupby

      df.groupby(["row", "col"])["val0"].size().unstack(fill_value=0)
  • pd.crosstab

      pd.crosstab(df["row"], df["col"])
  • pd.factorize + np.bincount

      # get integer factorization `i` and unique values `r`
      # for column `"row"`
      i, r = pd.factorize(df["row"].values)
      # get integer factorization `j` and unique values `c`
      # for column `"col"`
      j, c = pd.factorize(df["col"].values)
      # `n` will be the number of rows
      # `m` will be the number of columns
      n, m = r.size, c.size
      # `i * m + j` is a clever way of counting the
      # factorization bins assuming a flat array of length
      # `n * m`.  Which is why we subsequently reshape as `(n, m)`
      b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
      # BTW, whenever I read this, I think "Bean, Rice, and Cheese"
      pd.DataFrame(b, r, c)
            col3  col2  col0  col1  col4
      row3     2     0     0     1     0
      row2     1     2     1     0     2
      row0     1     0     1     2     1
      row4     2     2     0     1     1
  • pd.get_dummies

            col0  col1  col2  col3  col4
      row0     1     2     0     1     1
      row2     1     0     2     1     2
      row3     0     1     0     2     0
      row4     0     1     2     2     1

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?

  • DataFrame.pivot

    The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

      df2.insert(0, "count", df2.groupby("A").cumcount())
         count  A   B
      0      0  a   0
      1      1  a  11
      2      2  a   2
      3      3  a  11
      4      0  b  10
      5      1  b  10
      6      2  b  14
      7      0  c   7

    The second step is to use the newly created column as the index to call DataFrame.pivot.

      # df2.pivot(index="count", columns="A", values="B")
      A         a     b    c
      0       0.0  10.0  7.0
      1      11.0  10.0  NaN
      2       2.0  14.0  NaN
      3      11.0   NaN  NaN
  • DataFrame.pivot_table

    Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.

      df2.pivot_table(index=df2.groupby("A").cumcount(), columns="A", values="B")
      A         a     b    c
      0       0.0  10.0  7.0
      1      11.0  10.0  NaN
      2       2.0  14.0  NaN
      3      11.0   NaN  NaN

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns ="|".join)

else format

df.columns ="{0[0]}|{0[1]}".format)

Answer #8

From Python 3.6 onwards, the standard dict type maintains insertion order by default.


d = {"ac":33, "gw":20, "ap":102, "za":321, "bs":10}

will result in a dictionary with the keys in the order listed in the source code.

This was achieved by using a simple array with integers for the sparse hash table, where those integers index into another array that stores the key-value pairs (plus the calculated hash). That latter array just happens to store the items in insertion order, and the whole combination actually uses less memory than the implementation used in Python 3.5 and before. See the original idea post by Raymond Hettinger for details.

In 3.6 this was still considered an implementation detail; see the What"s New in Python 3.6 documentation:

The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5).

Python 3.7 elevates this implementation detail to a language specification, so it is now mandatory that dict preserves order in all Python implementations compatible with that version or newer. See the pronouncement by the BDFL. As of Python 3.8, dictionaries also support iteration in reverse.

You may still want to use the collections.OrderedDict() class in certain cases, as it offers some additional functionality on top of the standard dict type. Such as as being reversible (this extends to the view objects), and supporting reordering (via the move_to_end() method).

Answer #9

You can use PyPdf2s PdfMerger class.

File Concatenation

You can simply concatenate files by using the append method.

from PyPDF2 import PdfFileMerger

pdfs = ["file1.pdf", "file2.pdf", "file3.pdf", "file4.pdf"]

merger = PdfFileMerger()

for pdf in pdfs:


You can pass file handles instead file paths if you want.

File Merging

If you want more fine grained control of merging there is a merge method of the PdfMerger, which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file. The append method can be thought of as a merge where the insertion point is the end of the file.


merger.merge(2, pdf)

Here we insert the whole pdf into the output but at page 2.

Page Ranges

If you wish to control which pages are appended from a particular file, you can use the pages keyword argument of append and merge, passing a tuple in the form (start, stop[, step]) (like the regular range function).


merger.append(pdf, pages=(0, 3))    # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5

If you specify an invalid range you will get an IndexError.

Note: also that to avoid files being left open, the PdfFileMergers close method should be called when the merged file has been written. This ensures all files are closed (input and output) in a timely manner. It"s a shame that PdfFileMerger isn"t implemented as a context manager, so we can use the with keyword, avoid the explicit close call and get some easy exception safety.

You might also want to look at the pdfcat script provided as part of pypdf2. You can potentially avoid the need to write code altogether.

The PyPdf2 github also includes some example code demonstrating merging.


Another library perhaps worth a look is PyMuPdf which seems to be actively maintained. Merging is equally simple

From command line:

python -m fitz join -o result.pdf file1.pdf file2.pdf file3.pdf

and from code

import fitz

result =

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    with as mfile:

With plenty of options, detailed in the projects wiki.

Answer #10

As far as I can tell, this was caused by a conflict with the version of Python 3.7 that was recently added into the Windows Store. It looks like this added two "stubs" called python.exe and python3.exe into the %USERPROFILE%AppDataLocalMicrosoftWindowsApps folder, and in my case, this was inserted before my existing Python executable"s entry in the PATH.

Moving this entry below the correct Python folder (partially) corrected the issue.

The second part of correcting it is to type manage app execution aliases into the Windows search prompt and disable the store versions of Python altogether.

manage app execution aliases

It"s possible that you"ll only need to do the second part, but on my system I made both changes and everything is back to normal now.