  # Isdisjoint () function in Python

join | Python Methods and Functions

Examples:

` Let set A = {2, 4, 5, 6} and set B = {7, 8, 9, 10} set A and set B are said to be disjoint sets as their intersection is null. They do-not have any common elements in between them. `

Syntax :

set1.isdisjoint(set2)

Parameters:

The isdisjoint () method only takes one argument.
It may also require iteration (list, tuple, dictionary and string) to disjunction (). The isdisjoint () method automatically converts the iterables to sets and checks if the sets are disjoint or not.

Returned value:

returns True if the two sets are disjoint.
returns False if the twos sets are not disjoint.

Below is the implementation of the above approach in Python3:

 ` # Python3 program for isdisjoint () function `   ` set1 ` ` = ` ` {` ` 2 ` `, ` ` 4 ` `, ` ` 5 ` `, ` ` 6 ` `} `   ` set2 ` ` = ` ` {` ` 7 ` `, ` ` 8 ` `, ` ` 9 ` `, ` ` 10 ` `} ` ` set3 ` ` = ` ` {` ` 1 ` `, ` ` 2 ` `} `     ` # check clause of two sets ` ` print ` ` (` ` "set1 and set2 are disjoint? "` `, set1.isdisjoint (set2)) ` ` `  ` print ` ` (` "set1 and set3 are disjoint?" `, set1.isdisjoint (set3)) `

Output:

` set1 and set2 are disjoint? True set1 and set3 are disjoint? False `

## Why is it string.join(list) instead of list.join(string)?

### Question by Evan Fosmark

This has always confused me. It seems like this would be nicer:

``````my_list = ["Hello", "world"]
print(my_list.join("-"))
# Produce: "Hello-world"
``````

Than this:

``````my_list = ["Hello", "world"]
print("-".join(my_list))
# Produce: "Hello-world"
``````

Is there a specific reason it is like this?

## join list of lists in python

### Question by Kozyarchuk

Is the a short syntax for joining a list of lists into a single list( or iterator) in python?

For example I have a list as follows and I want to iterate over a,b and c.

``````x = [["a";"b"], ["c"]]
``````

The best I can come up with is as follows.

``````result = []
[ result.extend(el) for el in x]

for el in result:
print el
``````

## Why doesn"t os.path.join() work in this case?

The below code will not join, when debugged the command does not store the whole path but just the last entry.

``````os.path.join("/home/build/test/sandboxes/", todaystr, "/new_sandbox/")
``````

When I test this it only stores the `/new_sandbox/` part of the code.

## Join a list of items with different types as string in Python

I need to join a list of items. Many of the items in the list are integer values returned from a function; i.e.,

``````myList.append(munfunc())
``````

How should I convert the returned result to a string in order to join it with the list?

Do I need to do the following for every integer value:

``````myList.append(str(myfunc()))
``````

Is there a more Pythonic way to solve casting problems?

## Python string.join(list) on object array rather than string array

### Question by Mat

In Python, I can do:

``````>>> list = ["a", "b", "c"]
>>> ", ".join(list)
"a, b, c"
``````

Is there any easy way to do the same when I have a list of objects?

``````>>> class Obj:
...     def __str__(self):
...         return "name"
...
>>> list = [Obj(), Obj(), Obj()]
>>> ", ".join(list)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected string, instance found
``````

Or do I have to resort to a for loop?

## What is the difference between join and merge in Pandas?

Suppose I have two DataFrames like so:

``````left = pd.DataFrame({"key1": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key2": ["foo", "bar"], "rval": [4, 5]})
``````

I want to merge them, so I try something like this:

``````pd.merge(left, right, left_on="key1", right_on="key2")
``````

And I"m happy

``````    key1    lval    key2    rval
0   foo     1       foo     4
1   bar     2       bar     5
``````

But I"m trying to use the join method, which I"ve been lead to believe is pretty similar.

``````left.join(right, on=["key1", "key2"])
``````

And I get this:

``````//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406             if self.right_index:
407                 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408                     raise AssertionError()
409                 self.right_on = [None] * n
410         elif self.right_on is not None:

AssertionError:
``````

What am I missing?

## pandas: merge (join) two data frames on multiple columns

I am trying to join two pandas data frames using two columns:

``````new_df = pd.merge(A_df, B_df,  how="left", left_on="[A_c1,c2]", right_on = "[B_c1,c2]")
``````

but got the following error:

``````pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4164)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4028)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13166)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13120)()

KeyError: "[B_1, c2]"
``````

Any idea what should be the right way to do this? Thanks!

## What is the use of join() in Python threading?

I was studying the python threading and came across `join()`.

The author told that if thread is in daemon mode then i need to use `join()` so that thread can finish itself before main thread terminates.

but I have also seen him using `t.join()` even though `t` was not `daemon`

example code is this

``````import threading
import time
import logging

logging.basicConfig(level=logging.DEBUG,
)

def daemon():
logging.debug("Starting")
time.sleep(2)
logging.debug("Exiting")

d.setDaemon(True)

def non_daemon():
logging.debug("Starting")
logging.debug("Exiting")

d.start()
t.start()

d.join()
t.join()
``````

i don"t know what is use of `t.join()` as it is not daemon and i can see no change even if i remove it

## pandas three-way joining multiple dataframes on columns

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person"s string name?

The `join()` function in pandas specifies that I need a multiindex, but I"m confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

## What exactly does the .join() method do?

I"m pretty new to Python and am completely confused by `.join()` which I have read is the preferred method for concatenating strings.

I tried:

``````strid = repr(595)
print array.array("c", random.sample(string.ascii_letters, 20 - len(strid)))
.tostring().join(strid)
``````

and got something like:

``````5wlfgALGbXOahekxSs9wlfgALGbXOahekxSs5
``````

Why does it work like this? Shouldn"t the `595` just be automatically appended?

Since this question was asked in 2010, there has been real simplification in how to do simple multithreading with Python with map and pool.

The code below comes from an article/blog post that you should definitely check out (no affiliation) - Parallelism in one line: A Better Model for Day to Day Threading Tasks. I"ll summarize below - it ends up being just a few lines of code:

``````from multiprocessing.dummy import Pool as ThreadPool
results = pool.map(my_function, my_array)
``````

Which is the multithreaded version of:

``````results = []
for item in my_array:
results.append(my_function(item))
``````

Description

Map is a cool little function, and the key to easily injecting parallelism into your Python code. For those unfamiliar, map is something lifted from functional languages like Lisp. It is a function which maps another function over a sequence.

Map handles the iteration over the sequence for us, applies the function, and stores all of the results in a handy list at the end. Implementation

Parallel versions of the map function are provided by two libraries:multiprocessing, and also its little known, but equally fantastic step child:multiprocessing.dummy.

`multiprocessing.dummy` is exactly the same as multiprocessing module, but uses threads instead (an important distinction - use multiple processes for CPU-intensive tasks; threads for (and during) I/O):

multiprocessing.dummy replicates the API of multiprocessing, but is no more than a wrapper around the threading module.

``````import urllib2
from multiprocessing.dummy import Pool as ThreadPool

urls = [
"http://www.python.org",
"http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html",
"http://www.python.org/doc/",
"http://www.python.org/getit/",
"http://www.python.org/community/",
"https://wiki.python.org/moin/",
]

# Make the Pool of workers

# Open the URLs in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)

# Close the pool and wait for the work to finish
pool.close()
pool.join()
``````

And the timing results:

``````Single thread:   14.4 seconds
4 Pool:   3.1 seconds
8 Pool:   1.4 seconds
13 Pool:   1.3 seconds
``````

Passing multiple arguments (works like this only in Python 3.3 and later):

To pass multiple arrays:

``````results = pool.starmap(function, zip(list_a, list_b))
``````

Or to pass a constant and an array:

``````results = pool.starmap(function, zip(itertools.repeat(constant), list_a))
``````

If you are using an earlier version of Python, you can pass multiple arguments via this workaround).

(Thanks to user136036 for the helpful comment.)

## How to iterate over rows in a DataFrame in Pandas?

Iteration in Pandas is an anti-pattern and is something you should only do when you have exhausted every other option. You should not use any function with "`iter`" in its name for more than a few thousand rows or you will have to get used to a lot of waiting.

Do you want to print a DataFrame? Use `DataFrame.to_string()`.

Do you want to compute something? In that case, search for methods in this order (list modified from here):

1. Vectorization
2. Cython routines
3. List Comprehensions (vanilla `for` loop)
4. `DataFrame.apply()`: i) ¬†Reductions that can be performed in Cython, ii) Iteration in Python space
5. `DataFrame.itertuples()` and `iteritems()`
6. `DataFrame.iterrows()`

`iterrows` and `itertuples` (both receiving many votes in answers to this question) should be used in very rare circumstances, such as generating row objects/nametuples for sequential processing, which is really the only thing these functions are useful for.

Appeal to Authority

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

* It"s actually a little more complicated than "don"t". `df.iterrows()` is the correct answer to this question, but "vectorize your ops" is the better one. I will concede that there are circumstances where iteration cannot be avoided (for example, some operations where the result depends on the value computed for the previous row). However, it takes some familiarity with the library to know when. If you"re not sure whether you need an iterative solution, you probably don"t. PS: To know more about my rationale for writing this answer, skip to the very bottom.

## Faster than Looping: Vectorization, Cython

A good number of basic operations and computations are "vectorised" by pandas (either through NumPy, or through Cythonized functions). This includes arithmetic, comparisons, (most) reductions, reshaping (such as pivoting), joins, and groupby operations. Look through the documentation on Essential Basic Functionality to find a suitable vectorised method for your problem.

If none exists, feel free to write your own using custom Cython extensions.

## Next Best Thing: List Comprehensions*

List comprehensions should be your next port of call if 1) there is no vectorized solution available, 2) performance is important, but not important enough to go through the hassle of cythonizing your code, and 3) you"re trying to perform elementwise transformation on your code. There is a good amount of evidence to suggest that list comprehensions are sufficiently fast (and even sometimes faster) for many common Pandas tasks.

The formula is simple,

``````# Iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df["col"]]
# Iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]
# Iterating over multiple columns - same data type
result = [f(row, ..., row[n]) for row in df[["col1", ...,"coln"]].to_numpy()]
# Iterating over multiple columns - differing data type
result = [f(row, ..., row[n]) for row in zip(df["col1"], ..., df["coln"])]
``````

If you can encapsulate your business logic into a function, you can use a list comprehension that calls it. You can make arbitrarily complex things work through the simplicity and speed of raw Python code.

Caveats

List comprehensions assume that your data is easy to work with - what that means is your data types are consistent and you don"t have NaNs, but this cannot always be guaranteed.

1. The first one is more obvious, but when dealing with NaNs, prefer in-built pandas methods if they exist (because they have much better corner-case handling logic), or ensure your business logic includes appropriate NaN handling logic.
2. When dealing with mixed data types you should iterate over `zip(df["A"], df["B"], ...)` instead of `df[["A", "B"]].to_numpy()` as the latter implicitly upcasts data to the most common type. As an example if A is numeric and B is string, `to_numpy()` will cast the entire array to string, which may not be what you want. Fortunately `zip`ping your columns together is the most straightforward workaround to this.

*Your mileage may vary for the reasons outlined in the Caveats section above.

## An Obvious Example

Let"s demonstrate the difference with a simple example of adding two pandas columns `A + B`. This is a vectorizable operaton, so it will be easy to contrast the performance of the methods discussed above. Benchmarking code, for your reference. The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you"re doing. Stick to the API where you can (i.e., prefer `vec` over `vec_numpy`).

I should mention, however, that it isn"t always this cut and dry. Sometimes the answer to "what is the best method for an operation" is "it depends on your data". My advice is to test out different approaches on your data before settling on one.

* Pandas string methods are "vectorized" in the sense that they are specified on the series but operate on each element. The underlying mechanisms are still iterative, because string operations are inherently hard to vectorize.

## Why I Wrote this Answer

A common trend I notice from new users is to ask questions of the form "How can I iterate over my df to do X?". Showing code that calls `iterrows()` while doing something inside a `for` loop. Here is why. A new user to the library who has not been introduced to the concept of vectorization will likely envision the code that solves their problem as iterating over their data to do something. Not knowing how to iterate over a DataFrame, the first thing they do is Google it and end up here, at this question. They then see the accepted answer telling them how to, and they close their eyes and run this code without ever first questioning if iteration is not the right thing to do.

The aim of this answer is to help new users understand that iteration is not necessarily the solution to every problem, and that better, faster and more idiomatic solutions could exist, and that it is worth investing time in exploring them. I"m not trying to start a war of iteration vs. vectorization, but I want new users to be informed when developing solutions to their problems with this library.

# `os.listdir()` - list in the current directory

With listdir in os module you get the files and the folders in the current dir

`````` import os
arr = os.listdir()
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

## Looking in a directory

``````arr = os.listdir("c:\files")
``````

# `glob` from glob

with glob you can specify a type of file to list like this

``````import glob

txtfiles = []
for file in glob.glob("*.txt"):
txtfiles.append(file)
``````

## `glob` in a list comprehension

``````mylist = [f for f in glob.glob("*.txt")]
``````

## get the full path of only files in the current directory

``````import os
from os import listdir
from os.path import isfile, join

cwd = os.getcwd()
onlyfiles = [os.path.join(cwd, f) for f in os.listdir(cwd) if
os.path.isfile(os.path.join(cwd, f))]
print(onlyfiles)

["G:\getfilesname\getfilesname.py", "G:\getfilesname\example.txt"]
``````

## Getting the full path name with `os.path.abspath`

You get the full path in return

`````` import os
files_path = [os.path.abspath(x) for x in os.listdir()]
print(files_path)

["F:\documentiapplications.txt", "F:\documenticollections.txt"]
``````

## Walk: going through sub directories

os.walk returns the root, the directories list and the files list, that is why I unpacked them in r, d, f in the for loop; it, then, looks for other files and directories in the subfolders of the root and so on until there are no subfolders.

``````import os

# Getting the current work directory (cwd)
thisdir = os.getcwd()

# r=root, d=directories, f = files
for r, d, f in os.walk(thisdir):
for file in f:
if file.endswith(".docx"):
print(os.path.join(r, file))
``````

### `os.listdir()`: get files in the current directory (Python 2)

In Python 2, if you want the list of the files in the current directory, you have to give the argument as "." or os.getcwd() in the os.listdir method.

`````` import os
arr = os.listdir(".")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### To go up in the directory tree

``````# Method 1
x = os.listdir("..")

# Method 2
x= os.listdir("/")
``````

### Get files: `os.listdir()` in a particular directory (Python 2 and 3)

`````` import os
arr = os.listdir("F:\python")
print(arr)

>>> ["\$RECYCLE.BIN", "work.txt", "3ebooks.txt", "documents"]
``````

### Get files of a particular subdirectory with `os.listdir()`

``````import os

x = os.listdir("./content")
``````

### `os.walk(".")` - current directory

`````` import os
arr = next(os.walk("."))
print(arr)

>>> ["5bs_Turismo1.pdf", "5bs_Turismo1.pptx", "esperienza.txt"]
``````

### `next(os.walk("."))` and `os.path.join("dir", "file")`

`````` import os
arr = []
for d,r,f in next(os.walk("F:\_python")):
for file in f:
arr.append(os.path.join(r,file))

for f in arr:
print(files)

>>> F:\_python\dict_class.py
>>> F:\_python\programmi.txt
``````

### `next(os.walk("F:\")` - get the full path - list comprehension

`````` [os.path.join(r,file) for r,d,f in next(os.walk("F:\_python")) for file in f]

>>> ["F:\_python\dict_class.py", "F:\_python\programmi.txt"]
``````

### `os.walk` - get full path - all files in sub dirs**

``````x = [os.path.join(r,file) for r,d,f in os.walk("F:\_python") for file in f]
print(x)

``````

### `os.listdir()` - get only txt files

`````` arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
print(arr_txt)

>>> ["work.txt", "3ebooks.txt"]
``````

## Using `glob` to get the full path of the files

If I should need the absolute path of the files:

``````from path import path
from glob import glob
x = [path(f).abspath() for f in glob("F:\*.txt")]
for f in x:
print(f)

>>> F:acquistionline.txt
>>> F:acquisti_2018.txt
>>> F:ootstrap_jquery_ecc.txt
``````

## Using `os.path.isfile` to avoid directories in the list

``````import os.path
listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]
print(listOfFiles)

>>> ["a simple game.py", "data.txt", "decorator.py"]
``````

## Using `pathlib` from Python 3.4

``````import pathlib

flist = []
for p in pathlib.Path(".").iterdir():
if p.is_file():
print(p)
flist.append(p)

>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speak_gui2.py
>>> thumb.PNG
``````

With `list comprehension`:

``````flist = [p for p in pathlib.Path(".").iterdir() if p.is_file()]
``````

Alternatively, use `pathlib.Path()` instead of `pathlib.Path(".")`

## Use glob method in pathlib.Path()

``````import pathlib

py = pathlib.Path().glob("*.py")
for file in py:
print(file)

>>> stack_overflow_list.py
>>> stack_overflow_list_tkinter.py
``````

## Get all and only files with os.walk

``````import os
x = [i for i in os.walk(".")]
y=[]
for t in x:
for f in t:
y.append(f)
print(y)

>>> ["append_to_list.py", "data.txt", "data1.txt", "data2.txt", "data_180617", "os_walk.py", "READ2.py", "read_data.py", "somma_defaltdic.py", "substitute_words.py", "sum_data.py", "data.txt", "data1.txt", "data_180617"]
``````

## Get only files with next and walk in a directory

`````` import os
x = next(os.walk("F://python"))
print(x)

>>> ["calculator.bat","calculator.py"]
``````

## Get only directories with next and walk in a directory

`````` import os
next(os.walk("F://python")) # for the current dir use (".")

>>> ["python3","others"]
``````

## Get all the subdir names with `walk`

``````for r,d,f in os.walk("F:\_python"):
for dirs in d:
print(dirs)

>>> .vscode
>>> pyexcel
>>> pyschool.py
>>> subtitles
>>> _metaprogramming
>>> .ipynb_checkpoints
``````

## `os.scandir()` from Python 3.5 and greater

``````import os
x = [f.name for f in os.scandir() if f.is_file()]
print(x)

>>> ["calculator.bat","calculator.py"]

# Another example with scandir (a little variation from docs.python.org)
# This one is more efficient than os.listdir.
# In this case, it shows the files only in the current directory
# where the script is executed.

import os
with os.scandir() as i:
for entry in i:
if entry.is_file():
print(entry.name)

>>> ebookmaker.py
>>> error.PNG
>>> exemaker.bat
>>> guiprova.mp3
>>> setup.py
>>> speakgui4.py
>>> speak_gui2.py
>>> speak_gui3.py
>>> thumb.PNG
``````

# Examples:

## Ex. 1: How many files are there in the subdirectories?

In this example, we look for the number of files that are included in all the directory and its subdirectories.

``````import os

def count(dir, counter=0):
"returns number of files in dir and subdirs"
for pack in os.walk(dir):
for f in pack:
counter += 1
return dir + " : " + str(counter) + "files"

print(count("F:\python"))

>>> "F:\python" : 12057 files"
``````

## Ex.2: How to copy all files from a directory to another?

A script to make order in your computer finding all files of a type (default: pptx) and copying them in a new folder.

``````import os
import shutil
from path import path

destination = "F:\file_copied"
# os.makedirs(destination)

def copyfile(dir, filetype="pptx", counter=0):
"Searches for pptx (or other - pptx is the default) files and copies them"
for pack in os.walk(dir):
for f in pack:
if f.endswith(filetype):
fullpath = pack + "\" + f
print(fullpath)
shutil.copy(fullpath, destination)
counter += 1
if counter > 0:
print("-" * 30)
print("	==> Found in: `" + dir + "` : " + str(counter) + " files
")

for dir in os.listdir():
"searches for folders that starts with `_`"
if dir == "_":
# copyfile(dir, filetype="pdf")
copyfile(dir, filetype="txt")

>>> _compiti18Compito Contabilit√† 1conti.txt
>>> _compiti18Compito Contabilit√† 1modula4.txt
>>> _compiti18Compito Contabilit√† 1moduloa4.txt
>>> ------------------------
>>> ==> Found in: `_compiti18` : 3 files
``````

## Ex. 3: How to get all the files in a txt file

In case you want to create a txt file with all the file names:

``````import os
mylist = ""
with open("filelist.txt", "w", encoding="utf-8") as file:
for eachfile in os.listdir():
mylist += eachfile + "
"
file.write(mylist)
``````

## Example: txt with all the files of an hard drive

``````"""
We are going to save a txt file with all the files in your directory.
We will use the function walk()
"""

import os

# see all the methods of os
# print(*dir(os), sep=", ")
listafile = []
percorso = []
with open("lista_file.txt", "w", encoding="utf-8") as testo:
for root, dirs, files in os.walk("D:\"):
for file in files:
listafile.append(file)
percorso.append(root + "\" + file)
testo.write(file + "
")
listafile.sort()
print("N. of files", len(listafile))
with open("lista_file_ordinata.txt", "w", encoding="utf-8") as testo_ordinato:
for file in listafile:
testo_ordinato.write(file + "
")

with open("percorso.txt", "w", encoding="utf-8") as file_percorso:
for file in percorso:
file_percorso.write(file + "
")

os.system("lista_file.txt")
os.system("lista_file_ordinata.txt")
os.system("percorso.txt")
``````

## All the file of C: in one text file

This is a shorter version of the previous code. Change the folder where to start finding the files if you need to start from another position. This code generate a 50 mb on text file on my computer with something less then 500.000 lines with files with the complete path.

``````import os

with open("file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk("C:\"):
for file in f:
filewrite.write(f"{r + file}
")
``````

## How to write a file with all paths in a folder of a type

With this function you can create a txt file that will have the name of a type of file that you look for (ex. pngfile.txt) with all the full path of all the files of that type. It can be useful sometimes, I think.

``````import os

def searchfiles(extension=".ttf", folder="H:\"):
"Create a txt file with all the file of a type"
with open(extension[1:] + "file.txt", "w", encoding="utf-8") as filewrite:
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
filewrite.write(f"{r + file}
")

# looking for png file (fonts) in the hard disk H:
searchfiles(".png", "H:\")

>>> H:4bs_18Dolphins5.png
>>> H:4bs_18Dolphins6.png
>>> H:4bs_18Dolphins7.png
>>> H:5_18marketing htmlassetsimageslogo2.png
>>> H:7z001.png
>>> H:7z002.png
``````

## (New) Find all files and open them with tkinter GUI

I just wanted to add in this 2019 a little app to search for all files in a dir and be able to open them by doubleclicking on the name of the file in the list. ``````import tkinter as tk
import os

def searchfiles(extension=".txt", folder="H:\"):
"insert all files in the listbox"
for r, d, f in os.walk(folder):
for file in f:
if file.endswith(extension):
lb.insert(0, r + "\" + file)

def open_file():
os.startfile(lb.get(lb.curselection()))

root = tk.Tk()
root.geometry("400x400")
bt = tk.Button(root, text="Search", command=lambda:searchfiles(".png", "H:\"))
bt.pack()
lb = tk.Listbox(root)
lb.pack(fill="both", expand=1)
lb.bind("<Double-Button>", lambda x: open_file())
root.mainloop()
``````

The simplest way to get row counts per group is by calling `.size()`, which returns a `Series`:

``````df.groupby(["col1","col2"]).size()
``````

Usually you want this result as a `DataFrame` (instead of a `Series`) so you can do:

``````df.groupby(["col1", "col2"]).size().reset_index(name="counts")
``````

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

## Detailed example:

Consider the following example dataframe:

``````In : df
Out:
col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17
``````

First let"s use `.size()` to get the row counts:

``````In : df.groupby(["col1", "col2"]).size()
Out:
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64
``````

Then let"s use `.size().reset_index(name="counts")` to get the row counts:

``````In : df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out:
col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1
``````

### Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

``````In : (df
...: .groupby(["col1", "col2"])
...: .agg({
...:     "col3": ["mean", "count"],
...:     "col4": ["median", "min", "count"]
...: }))
Out:
col4                  col3
median   min count      mean count
col1 col2
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1
``````

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`. It looks like this:

``````In : gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...:  .reset_index()
...: )
...:
Out:
col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63
``````

### Footnotes

The code used to generate the test data is shown below:

``````In : import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["A", "B"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["C", "D"],
...:         ["E", "F"],
...:         ["E", "F"],
...:         ["G", "H"]
...:         ])
...:
...: df = pd.DataFrame(
...:     np.hstack([keys,np.random.randn(10,4).round(2)]),
...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...:     df[["col3", "col4", "col5", "col6"]].astype(float)
...:
``````

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN` entries in the mean calculation without telling you about it.

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here"s what this post will go through:

• The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

• merging with different column names
• merging with multiple columns
• avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

• Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
• Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this post on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

# Enough talk - just show me how to use `merge`!

### Setup & Basics

``````np.random.seed(0)
left = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
right = pd.DataFrame({"key": ["B", "D", "E", "F"], "value": np.random.randn(4)})

left

key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357
``````

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by Note This, along with the forthcoming figures all follow this convention:

• blue indicates rows that are present in the merge result
• red indicates rows that are excluded from the result (i.e., removed)
• green indicates missing values that are replaced with `NaN`s in the result

To perform an INNER JOIN, call `merge` on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

``````left.merge(right, on="key")
# Or, if you want to be explicit
# left.merge(right, on="key", how="inner")

key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
``````

This returns only rows from `left` and `right` which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by This can be performed by specifying `how="left"`.

``````left.merge(right, on="key", how="left")

key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
``````

Carefully note the placement of NaNs here. If you specify `how="left"`, then only keys from `left` are used, and missing data from `right` is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is... ...specify `how="right"`:

``````left.merge(right, on="key", how="right")

key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357
``````

Here, keys from `right` are used, and missing data from `left` is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by specify `how="outer"`.

``````left.merge(right, on="key", how="outer")

key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
``````

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely: ### Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from `left` only,

``````(left.merge(right, on="key", how="left", indicator=True)
.query("_merge == "left_only"")
.drop("_merge", 1))

key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN
``````

Where,

``````left.merge(right, on="key", how="left", indicator=True)

key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both``````

And similarly, for a RIGHT-Excluding JOIN, ``````(left.merge(right, on="key", how="right", indicator=True)
.query("_merge == "right_only"")
.drop("_merge", 1))

key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357``````

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN), You can do this in similar fashion‚Äî

``````(left.merge(right, on="key", how="outer", indicator=True)
.query("_merge != "both"")
.drop("_merge", 1))

key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357
``````

### Different names for key columns

If the key columns are named differently‚Äîfor example, `left` has `keyLeft`, and `right` has `keyRight` instead of `key`‚Äîthen you will have to specify `left_on` and `right_on` as arguments instead of `on`:

``````left2 = left.rename({"key":"keyLeft"}, axis=1)
right2 = right.rename({"key":"keyRight"}, axis=1)

left2

keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
``````
``````left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")

keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278
``````

### Avoiding duplicate key column in output

When merging on `keyLeft` from `left` and `keyRight` from `right`, if you only want either of the `keyLeft` or `keyRight` (but not both) in the output, you can start by setting the index as a preliminary step.

``````left3 = left2.set_index("keyLeft")
left3.merge(right2, left_index=True, right_on="keyRight")

value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278
``````

Contrast this with the output of the command just before (that is, the output of `left2.merge(right2, left_on="keyLeft", right_on="keyRight", how="inner")`), you"ll notice `keyLeft` is missing. You can figure out what column to keep based on which frame"s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

### Merging only a single column from one of the `DataFrames`

For example, consider

``````right3 = right.assign(newcol=np.arange(len(right)))
right3
key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3
``````

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

``````left.merge(right3[["key", "newcol"]], on="key")

key     value  newcol
0   B  0.400157       0
1   D  2.240893       1
``````

If you"re doing a LEFT OUTER JOIN, a more performant solution would involve `map`:

``````# left["newcol"] = left["key"].map(right3.set_index("key")["newcol"]))
left.assign(newcol=left["key"].map(right3.set_index("key")["newcol"]))

key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
``````

As mentioned, this is similar to, but faster than

``````left.merge(right3[["key", "newcol"]], on="key", how="left")

key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0
``````

### Merging on multiple columns

To join on more than one column, specify a list for `on` (or `left_on` and `right_on`, as appropriate).

``````left.merge(right, on=["key1", "key2"] ...)
``````

Or, in the event the names are different,

``````left.merge(right, left_on=["lkey1", "lkey2"], right_on=["rkey1", "rkey2"])
``````

### Other useful `merge*` operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on `merge`, `join`, and `concat` as well as the links to the function specifications.

*You are here.

# TL;DR version:

For the simple case of:

• I have a text column with a delimiter and I want two columns

The simplest solution is:

``````df[["A", "B"]] = df["AB"].str.split(" ", 1, expand=True)
``````

You must use `expand=True` if your strings have a non-uniform number of splits and you want `None` to replace the missing values.

Notice how, in either case, the `.tolist()` method is not necessary. Neither is `zip()`.

# In detail:

Andy Hayden"s solution is most excellent in demonstrating the power of the `str.extract()` method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the `.str.split()` method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

``````>>> import pandas as pd
>>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2"]})
>>> df

AB
0  A1-B1
1  A2-B2
>>> df["AB_split"] = df["AB"].str.split("-")
>>> df

AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]
``````

1: If you"re unsure what the first two parameters of `.str.split()` do, I recommend the docs for the plain Python version of the method.

But how do you go from:

• a column containing two-element lists

to:

• two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the `.str` attribute of a column.

It"s a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

``````>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

U  L
0  A  a
1  B  b
2  C  c
``````

But it also has an "indexing" interface for getting each element of a string by its index:

``````>>> df["AB"].str

0    A
1    A
Name: AB, dtype: object

>>> df["AB"].str

0    1
1    2
Name: AB, dtype: object
``````

Of course, this indexing interface of `.str` doesn"t really care if each element it"s indexing is actually a string, as long as it can be indexed, so:

``````>>> df["AB"].str.split("-", 1).str

0    A1
1    A2
Name: AB, dtype: object

>>> df["AB"].str.split("-", 1).str

0    B1
1    B2
Name: AB, dtype: object
``````

Then, it"s a simple matter of taking advantage of the Python tuple unpacking of iterables to do

``````>>> df["A"], df["B"] = df["AB"].str.split("-", 1).str
>>> df

AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2
``````

Of course, getting a DataFrame out of splitting a column of strings is so useful that the `.str.split()` method can do it for you with the `expand=True` parameter:

``````>>> df["AB"].str.split("-", 1, expand=True)

0   1
0  A1  B1
1  A2  B2
``````

So, another way of accomplishing what we wanted is to do:

``````>>> df = df[["AB"]]
>>> df

AB
0  A1-B1
1  A2-B2

>>> df.join(df["AB"].str.split("-", 1, expand=True).rename(columns={0:"A", 1:"B"}))

AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2
``````

The `expand=True` version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn"t deal well with splits of different lengths:

``````>>> df = pd.DataFrame({"AB": ["A1-B1", "A2-B2", "A3-B3-C3"]})
>>> df
AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df["A"], df["B"], df["C"] = df["AB"].str.split("-")
Traceback (most recent call last):
[...]
ValueError: Length of values does not match length of index
>>>
``````

But `expand=True` handles it nicely by placing `None` in the columns for which there aren"t enough "splits":

``````>>> df.join(
...     df["AB"].str.split("-", expand=True).rename(
...         columns={0:"A", 1:"B", 2:"C"}
...     )
... )
AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3
``````

Your understanding is mostly correct. You use `select_related` when the object that you"re going to be selecting is a single object, so `OneToOneField` or a `ForeignKey`. You use `prefetch_related` when you"re going to get a "set" of things, so `ManyToManyField`s as you stated or reverse `ForeignKey`s. Just to clarify what I mean by "reverse `ForeignKey`s" here"s an example:

``````class ModelA(models.Model):
pass

class ModelB(models.Model):
a = ForeignKey(ModelA)

ModelB.objects.select_related("a").all() # Forward ForeignKey relationship
ModelA.objects.prefetch_related("modelb_set").all() # Reverse ForeignKey relationship
``````

The difference is that `select_related` does an SQL join and therefore gets the results back as part of the table from the SQL server. `prefetch_related` on the other hand executes another query and therefore reduces the redundant columns in the original object (`ModelA` in the above example). You may use `prefetch_related` for anything that you can use `select_related` for.

The tradeoffs are that `prefetch_related` has to create and send a list of IDs to select back to the server, this can take a while. I"m not sure if there"s a nice way of doing this in a transaction, but my understanding is that Django always just sends a list and says SELECT ... WHERE pk IN (...,...,...) basically. In this case if the prefetched data is sparse (let"s say U.S. State objects linked to people"s addresses) this can be very good, however if it"s closer to one-to-one, this can waste a lot of communications. If in doubt, try both and see which performs better.

Everything discussed above is basically about the communications with the database. On the Python side however `prefetch_related` has the extra benefit that a single object is used to represent each object in the database. With `select_related` duplicate objects will be created in Python for each "parent" object. Since objects in Python have a decent bit of memory overhead this can also be a consideration.

Use `merge`, which is an inner join by default:

``````pd.merge(df1, df2, left_index=True, right_index=True)
``````

Or `join`, which is a left join by default:

``````df1.join(df2)
``````

Or `concat`), which is an outer join by default:

``````pd.concat([df1, df2], axis=1)
``````

Samples:

``````df1 = pd.DataFrame({"a":range(6),
"b":[5,3,6,9,2,4]}, index=list("abcdef"))

print (df1)
a  b
a  0  5
b  1  3
c  2  6
d  3  9
e  4  2
f  5  4

df2 = pd.DataFrame({"c":range(4),
"d":[10,20,30, 40]}, index=list("abhi"))

print (df2)
c   d
a  0  10
b  1  20
h  2  30
i  3  40
``````

``````# Default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a  b  c   d
a  0  5  0  10
b  1  3  1  20

# Default left join
df4 = df1.join(df2)
print (df4)
a  b    c     d
a  0  5  0.0  10.0
b  1  3  1.0  20.0
c  2  6  NaN   NaN
d  3  9  NaN   NaN
e  4  2  NaN   NaN
f  5  4  NaN   NaN

# Default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a    b    c     d
a  0.0  5.0  0.0  10.0
b  1.0  3.0  1.0  20.0
c  2.0  6.0  NaN   NaN
d  3.0  9.0  NaN   NaN
e  4.0  2.0  NaN   NaN
f  5.0  4.0  NaN   NaN
h  NaN  NaN  2.0  30.0
i  NaN  NaN  3.0  40.0
``````

Try this

``````new_df = pd.merge(A_df, B_df,  how="left", left_on=["A_c1","c2"], right_on = ["B_c1","c2"])
``````

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs

`pandas.merge()` is the underlying function used for all merge/join behavior.

DataFrames provide the `pandas.DataFrame.merge()` and `pandas.DataFrame.join()` methods as a convenient way to access the capabilities of `pandas.merge()`. For example, `df1.merge(right=df2, ...)` is equivalent to `pandas.merge(left=df1, right=df2, ...)`.

These are the main differences between `df.join()` and `df.merge()`:

1. lookup on right table: `df1.join(df2)` always joins via the index of `df2`, but `df1.merge(df2)` can join to one or more columns of `df2` (default) or to the index of `df2` (with `right_index=True`).
2. lookup on left table: by default, `df1.join(df2)` uses the index of `df1` and `df1.merge(df2)` uses column(s) of `df1`. That can be overridden by specifying `df1.join(df2, on=key_or_keys)` or `df1.merge(df2, left_index=True)`.
3. left vs inner join: `df1.join(df2)` does a left join by default (keeps all rows of `df1`), but `df.merge` does an inner join by default (returns only matching rows of `df1` and `df2`).

So, the generic approach is to use `pandas.merge(df1, df2)` or `df1.merge(df2)`. But for a number of common situations (keeping all rows of `df1` and joining to an index in `df2`), you can save some typing by using `df1.join(df2)` instead.

Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:

`merge` is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related `DataFrame.join` method, uses `merge` internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for `merge`). If you are joining on index, you may wish to use `DataFrame.join` to save yourself some typing.

...

These two function calls are completely equivalent:

``````left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how="left", sort=False)
``````