Pandas Merging 101

| | |

👻 Check our latest review to choose the best laptop for Machine Learning engineers and Deep learning tasks!

  • How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
  • How do I add NaNs for missing rows after a merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • How do I merge multiple DataFrames?
  • Cross join with pandas
  • merge? join? concat? update? Who? What? Why?!

... and more. I"ve seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.


Table of Contents

For ease of access.

👻 Read also: what is the best laptop for engineering students?

Pandas Merging 101 join: Questions

Why is it string.join(list) instead of list.join(string)?

5 answers

Evan Fosmark By Evan Fosmark

This has always confused me. It seems like this would be nicer:

my_list = ["Hello", "world"]
print(my_list.join("-"))
# Produce: "Hello-world"

Than this:

my_list = ["Hello", "world"]
print("-".join(my_list))
# Produce: "Hello-world"

Is there a specific reason it is like this?

1906

Answer #1

It"s because any iterable can be joined (e.g, list, tuple, dict, set), but its contents and the "joiner" must be strings.

For example:

"_".join(["welcome", "to", "stack", "overflow"])
"_".join(("welcome", "to", "stack", "overflow"))
"welcome_to_stack_overflow"

Using something other than strings will raise the following error:

TypeError: sequence item 0: expected str instance, int found

1906

Answer #2

This was discussed in the String methods... finally thread in the Python-Dev achive, and was accepted by Guido. This thread began in Jun 1999, and str.join was included in Python 1.6 which was released in Sep 2000 (and supported Unicode). Python 2.0 (supported str methods including join) was released in Oct 2000.

  • There were four options proposed in this thread:
    • str.join(seq)
    • seq.join(str)
    • seq.reduce(str)
    • join as a built-in function
  • Guido wanted to support not only lists and tuples, but all sequences/iterables.
  • seq.reduce(str) is difficult for newcomers.
  • seq.join(str) introduces unexpected dependency from sequences to str/unicode.
  • join() as a built-in function would support only specific data types. So using a built-in namespace is not good. If join() supports many datatypes, creating an optimized implementation would be difficult, if implemented using the __add__ method then it would ve O(n¬≤).
  • The separator string (sep) should not be omitted. Explicit is better than implicit.

Here are some additional thoughts (my own, and my friend"s):

  • Unicode support was coming, but it was not final. At that time UTF-8 was the most likely about to replace UCS2/4. To calculate total buffer length of UTF-8 strings it needs to know character coding rule.
  • At that time, Python had already decided on a common sequence interface rule where a user could create a sequence-like (iterable) class. But Python didn"t support extending built-in types until 2.2. At that time it was difficult to provide basic iterable class (which is mentioned in another comment).

Guido"s decision is recorded in a historical mail, deciding on str.join(seq):

Funny, but it does seem right! Barry, go for it...
Guido van Rossum

1906

Answer #3

Because the join() method is in the string class, instead of the list class?

I agree it looks funny.

See http://www.faqs.org/docs/diveintopython/odbchelper_join.html:

Historical note. When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Lots of people feel the same way, and there’s a story behind the join method. Prior to Python 1.6, strings didn’t have all these useful methods. There was a separate string module which contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn’t move at all but simply stay a part of the old string module (which still has lots of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead.

--- Mark Pilgrim, Dive into Python

Meaning of @classmethod and @staticmethod for beginner?

5 answers

user1632861 By user1632861

Could someone explain to me the meaning of @classmethod and @staticmethod in python? I need to know the difference and the meaning.

As far as I understand, @classmethod tells a class that it"s a method which should be inherited into subclasses, or... something. However, what"s the point of that? Why not just define the class method without adding @classmethod or @staticmethod or any @ definitions?

tl;dr: when should I use them, why should I use them, and how should I use them?

1726

Answer #1

Though classmethod and staticmethod are quite similar, there"s a slight difference in usage for both entities: classmethod must have a reference to a class object as the first parameter, whereas staticmethod can have no parameters at all.

Example

class Date(object):

    def __init__(self, day=0, month=0, year=0):
        self.day = day
        self.month = month
        self.year = year

    @classmethod
    def from_string(cls, date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        date1 = cls(day, month, year)
        return date1

    @staticmethod
    def is_date_valid(date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        return day <= 31 and month <= 12 and year <= 3999

date2 = Date.from_string("11-09-2012")
is_date = Date.is_date_valid("11-09-2012")

Explanation

Let"s assume an example of a class, dealing with date information (this will be our boilerplate):

class Date(object):

    def __init__(self, day=0, month=0, year=0):
        self.day = day
        self.month = month
        self.year = year

This class obviously could be used to store information about certain dates (without timezone information; let"s assume all dates are presented in UTC).

Here we have __init__, a typical initializer of Python class instances, which receives arguments as a typical instancemethod, having the first non-optional argument (self) that holds a reference to a newly created instance.

Class Method

We have some tasks that can be nicely done using classmethods.

Let"s assume that we want to create a lot of Date class instances having date information coming from an outer source encoded as a string with format "dd-mm-yyyy". Suppose we have to do this in different places in the source code of our project.

So what we must do here is:

  1. Parse a string to receive day, month and year as three integer variables or a 3-item tuple consisting of that variable.
  2. Instantiate Date by passing those values to the initialization call.

This will look like:

day, month, year = map(int, string_date.split("-"))
date1 = Date(day, month, year)

For this purpose, C++ can implement such a feature with overloading, but Python lacks this overloading. Instead, we can use classmethod. Let"s create another "constructor".

    @classmethod
    def from_string(cls, date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        date1 = cls(day, month, year)
        return date1

date2 = Date.from_string("11-09-2012")

Let"s look more carefully at the above implementation, and review what advantages we have here:

  1. We"ve implemented date string parsing in one place and it"s reusable now.
  2. Encapsulation works fine here (if you think that you could implement string parsing as a single function elsewhere, this solution fits the OOP paradigm far better).
  3. cls is an object that holds the class itself, not an instance of the class. It"s pretty cool because if we inherit our Date class, all children will have from_string defined also.

Static method

What about staticmethod? It"s pretty similar to classmethod but doesn"t take any obligatory parameters (like a class method or instance method does).

Let"s look at the next use case.

We have a date string that we want to validate somehow. This task is also logically bound to the Date class we"ve used so far, but doesn"t require instantiation of it.

Here is where staticmethod can be useful. Let"s look at the next piece of code:

    @staticmethod
    def is_date_valid(date_as_string):
        day, month, year = map(int, date_as_string.split("-"))
        return day <= 31 and month <= 12 and year <= 3999

    # usage:
    is_date = Date.is_date_valid("11-09-2012")

So, as we can see from usage of staticmethod, we don"t have any access to what the class is---it"s basically just a function, called syntactically like a method, but without access to the object and its internals (fields and another methods), while classmethod does.

1726

Answer #2

Rostyslav Dzinko"s answer is very appropriate. I thought I could highlight one other reason you should choose @classmethod over @staticmethod when you are creating an additional constructor.

In the example above, Rostyslav used the @classmethod from_string as a Factory to create Date objects from otherwise unacceptable parameters. The same can be done with @staticmethod as is shown in the code below:

class Date:
  def __init__(self, month, day, year):
    self.month = month
    self.day   = day
    self.year  = year


  def display(self):
    return "{0}-{1}-{2}".format(self.month, self.day, self.year)


  @staticmethod
  def millenium(month, day):
    return Date(month, day, 2000)

new_year = Date(1, 1, 2013)               # Creates a new Date object
millenium_new_year = Date.millenium(1, 1) # also creates a Date object. 

# Proof:
new_year.display()           # "1-1-2013"
millenium_new_year.display() # "1-1-2000"

isinstance(new_year, Date) # True
isinstance(millenium_new_year, Date) # True

Thus both new_year and millenium_new_year are instances of the Date class.

But, if you observe closely, the Factory process is hard-coded to create Date objects no matter what. What this means is that even if the Date class is subclassed, the subclasses will still create plain Date objects (without any properties of the subclass). See that in the example below:

class DateTime(Date):
  def display(self):
      return "{0}-{1}-{2} - 00:00:00PM".format(self.month, self.day, self.year)


datetime1 = DateTime(10, 10, 1990)
datetime2 = DateTime.millenium(10, 10)

isinstance(datetime1, DateTime) # True
isinstance(datetime2, DateTime) # False

datetime1.display() # returns "10-10-1990 - 00:00:00PM"
datetime2.display() # returns "10-10-2000" because it"s not a DateTime object but a Date object. Check the implementation of the millenium method on the Date class for more details.

datetime2 is not an instance of DateTime? WTF? Well, that"s because of the @staticmethod decorator used.

In most cases, this is undesired. If what you want is a Factory method that is aware of the class that called it, then @classmethod is what you need.

Rewriting Date.millenium as (that"s the only part of the above code that changes):

@classmethod
def millenium(cls, month, day):
    return cls(month, day, 2000)

ensures that the class is not hard-coded but rather learnt. cls can be any subclass. The resulting object will rightly be an instance of cls.
Let"s test that out:

datetime1 = DateTime(10, 10, 1990)
datetime2 = DateTime.millenium(10, 10)

isinstance(datetime1, DateTime) # True
isinstance(datetime2, DateTime) # True


datetime1.display() # "10-10-1990 - 00:00:00PM"
datetime2.display() # "10-10-2000 - 00:00:00PM"

The reason is, as you know by now, that @classmethod was used instead of @staticmethod

1726

Answer #3

@classmethod means: when this method is called, we pass the class as the first argument instead of the instance of that class (as we normally do with methods). This means you can use the class and its properties inside that method rather than a particular instance.

@staticmethod means: when this method is called, we don"t pass an instance of the class to it (as we normally do with methods). This means you can put a function inside a class but you can"t access the instance of that class (this is useful when your method does not use the instance).

We hope this article has helped you to resolve the problem. Apart from Pandas Merging 101, check other join-related topics.

Want to excel in Python? See our review of the best Python online courses 2022. If you are interested in Data Science, check also how to learn programming in R.

By the way, this material is also available in other languages:



Xu Zelotti

Moscow | 2022-11-30

I was preparing for my coding interview, thanks for clarifying this - Pandas Merging 101 in Python is not the simplest one. Checked yesterday, it works!

Anna Gonzalez

Berlin | 2022-11-30

Simply put and clear. Thank you for sharing. Pandas Merging 101 and other issues with sin was always my weak point 😁. Will get back tomorrow with feedback

Boris Lehnman

Milan | 2022-11-30

mean is always a bit confusing 😭 Pandas Merging 101 is not the only problem I encountered. Checked yesterday, it works!

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically