Python | Pandas Series.median ()

median | Python Methods and Functions

Series.median() Pandas Series.median() returns the median of the underlying data in the given Series object .

Syntax: Series.median (axis = None, skipna = None, level = None, numeric_only = None, ** kwargs)

axis: Axis for the function to be applied on.
skipna: Exclude NA / null values when computing the result.
level: If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.
numeric_only: Include only float, int, boolean columns
** kwargs: Additional keyword arguments to be passed to the function.

Returns: median: scalar or Series (if level specified)

Example # 1: Use Series.median () to find the median of the base data in this object with series.

# import pandas as pd

import pandas as pd

# Create series

sr = pd.Series ([ 10 , 25 , 3 , 25 , 24 , 6 ])

# Create index

index_ = [  'Coca Cola' , ' Sprite' , ' Coke' , 'Fanta' , 'Dew' , ' ThumbsUp' ]

# set index

sr.index = index_

# Print series

print (sr)


We will now use Series.median () to find the median of this series object.

# return median

result = sr.median ()

# Print result

print (result )


As we can see from the output, Series.median () successfully returned the median of the given object series.

Example # 2: Use Series.median () to find the median of the underlying data in a given series object. This series object contains some missing values.

# import pandas as pd

import pandas as pd

# Create series

sr = pd.Series ([ 19.5 , 16.8 , None , 22.78 , 16.8 , 20.124 , None , 18.1002 , 19.5 ])

# Print series

print (sr)


We will now use Series.median () to find the median of a given series object ... we're going to skip missing values ​​when calculating the median in this series object.

# return median

result = sr.median (skipna = True )

# Print result

print (result)


As we can see from the output, Series.median () has successfully returned the median of the given series object.

Python | Pandas Series.median (): StackOverflow Questions

Finding median of list in Python

How do you find the median of a list in Python? The list can be of any size and the numbers are not guaranteed to be in any particular order.

If the list contains an even number of elements, the function should return the average of the middle two.

Here are some examples (sorted for display purposes):

median([1]) == 1
median([1, 1]) == 1
median([1, 1, 2, 4]) == 1.5
median([0, 2, 5, 6, 8, 9, 9]) == 6
median([0, 0, 0, 0, 4, 4, 6, 8]) == 2

Answer #1

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:


Usually you want this result as a DataFrame (instead of a Series) so you can do:

df.groupby(["col1", "col2"]).size().reset_index(name="counts")

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Detailed example:

Consider the following example dataframe:

In [2]: df
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let"s use .size() to get the row counts:

In [3]: df.groupby(["col1", "col2"]).size()
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let"s use .size().reset_index(name="counts") to get the row counts:

In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1

Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

In [5]: (df
   ...: .groupby(["col1", "col2"])
   ...: .agg({
   ...:     "col3": ["mean", "count"], 
   ...:     "col4": ["median", "min", "count"]
   ...: }))
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

In [6]: gb = df.groupby(["col1", "col2"])
   ...: counts = gb.size().to_frame(name="counts")
   ...: (counts
   ...:  .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
   ...:  .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
   ...:  .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
   ...:  .reset_index()
   ...: )
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63


The code used to generate the test data is shown below:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: keys = np.array([
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["A", "B"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["C", "D"],
   ...:         ["E", "F"],
   ...:         ["E", "F"],
   ...:         ["G", "H"] 
   ...:         ])
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
   ...: )
   ...: df[["col3", "col4", "col5", "col6"]] = 
   ...:     df[["col3", "col4", "col5", "col6"]].astype(float)


If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

Answer #2

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut for quintiles.

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6

Conversely, for cut you will see something more uneven:

pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2

That"s because cut will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you"ll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

Answer #3

Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.

Some functions were modified to work with Python 3 (mainly by replacing / with // to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.

In summary: there"s about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here (via this comment). David Zhang"s solution is the clear winner, outperforming all others by around 5x for the large example set.

A couple of the answers are very slow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.

Here are the results, including plots made using matplotlib and seaborn to show the different distributions:

Corpus 1 (supplied examples - small set)

mean performance:
 0.0003  david_zhang
 0.0009  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0015  carpetpython
 0.0029  tigerhawk_1
 0.0031  davidism
 0.0035  saksham
 0.0046  shashank
 0.0052  riad
 0.0056  piotr

median performance:
 0.0003  david_zhang
 0.0008  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0014  carpetpython
 0.0027  tigerhawk_1
 0.0031  davidism
 0.0038  saksham
 0.0044  shashank
 0.0054  riad
 0.0058  piotr

Corpus 1 graph

Corpus 2 (supplied examples - large set)

mean performance:
 0.0006  david_zhang
 0.0036  tigerhawk_2
 0.0036  antti
 0.0037  zero
 0.0039  carpetpython
 0.0052  shashank
 0.0056  piotr
 0.0066  davidism
 0.0120  tigerhawk_1
 0.0177  riad
 0.0283  saksham

median performance:
 0.0004  david_zhang
 0.0018  zero
 0.0022  tigerhawk_2
 0.0022  antti
 0.0024  carpetpython
 0.0043  davidism
 0.0049  shashank
 0.0055  piotr
 0.0061  tigerhawk_1
 0.0077  riad
 0.0109  saksham

Corpus 1 graph

Corpus 3 (edge cases)

mean performance:
 0.0123  shashank
 0.0375  david_zhang
 0.0376  piotr
 0.0394  carpetpython
 0.0479  antti
 0.0488  tigerhawk_2
 0.2269  tigerhawk_1
 0.2336  davidism
 0.7239  saksham
 3.6265  zero
 6.0111  riad

median performance:
 0.0107  tigerhawk_2
 0.0108  antti
 0.0109  carpetpython
 0.0135  david_zhang
 0.0137  tigerhawk_1
 0.0150  shashank
 0.0229  saksham
 0.0255  piotr
 0.0721  davidism
 0.1080  zero
 1.8539  riad

Corpus 3 graph

The tests and raw results are available here.

Answer #4

To understand what yield does, you must understand what generators are. And before you can understand generators, you must understand iterables.


When you create a list, you can read its items one by one. Reading its items one by one is called iteration:

>>> mylist = [1, 2, 3]
>>> for i in mylist:
...    print(i)

mylist is an iterable. When you use a list comprehension, you create a list, and so an iterable:

>>> mylist = [x*x for x in range(3)]
>>> for i in mylist:
...    print(i)

Everything you can use "for... in..." on is an iterable; lists, strings, files...

These iterables are handy because you can read them as much as you wish, but you store all the values in memory and this is not always what you want when you have a lot of values.


Generators are iterators, a kind of iterable you can only iterate over once. Generators do not store all the values in memory, they generate the values on the fly:

>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
...    print(i)

It is just the same except you used () instead of []. BUT, you cannot perform for i in mygenerator a second time since generators can only be used once: they calculate 0, then forget about it and calculate 1, and end calculating 4, one by one.


yield is a keyword that is used like return, except the function will return a generator.

>>> def create_generator():
...    mylist = range(3)
...    for i in mylist:
...        yield i*i
>>> mygenerator = create_generator() # create a generator
>>> print(mygenerator) # mygenerator is an object!
<generator object create_generator at 0xb7555c34>
>>> for i in mygenerator:
...     print(i)

Here it"s a useless example, but it"s handy when you know your function will return a huge set of values that you will only need to read once.

To master yield, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object, this is a bit tricky.

Then, your code will continue from where it left off each time for uses the generator.

Now the hard part:

The first time the for calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield, then it"ll return the first value of the loop. Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value. This will continue until the generator is considered empty, which happens when the function runs without hitting yield. That can be because the loop has come to an end, or because you no longer satisfy an "if/else".

Your code explained


# Here you create the method of the node object that will return the generator
def _get_child_candidates(self, distance, min_dist, max_dist):

    # Here is the code that will be called each time you use the generator object:

    # If there is still a child of the node object on its left
    # AND if the distance is ok, return the next child
    if self._leftchild and distance - max_dist < self._median:
        yield self._leftchild

    # If there is still a child of the node object on its right
    # AND if the distance is ok, return the next child
    if self._rightchild and distance + max_dist >= self._median:
        yield self._rightchild

    # If the function arrives here, the generator will be considered empty
    # there is no more than two values: the left and the right children


# Create an empty list and a list with the current object reference
result, candidates = list(), [self]

# Loop on candidates (they contain only one element at the beginning)
while candidates:

    # Get the last candidate and remove it from the list
    node = candidates.pop()

    # Get the distance between obj and the candidate
    distance = node._get_dist(obj)

    # If distance is ok, then you can fill the result
    if distance <= max_dist and distance >= min_dist:

    # Add the children of the candidate in the candidate"s list
    # so the loop will keep running until it will have looked
    # at all the children of the children of the children, etc. of the candidate
    candidates.extend(node._get_child_candidates(distance, min_dist, max_dist))

return result

This code contains several smart parts:

  • The loop iterates on a list, but the list expands while the loop is being iterated. It"s a concise way to go through all these nested data even if it"s a bit dangerous since you can end up with an infinite loop. In this case, candidates.extend(node._get_child_candidates(distance, min_dist, max_dist)) exhaust all the values of the generator, but while keeps creating new generator objects which will produce different values from the previous ones since it"s not applied on the same node.

  • The extend() method is a list object method that expects an iterable and adds its values to the list.

Usually we pass a list to it:

>>> a = [1, 2]
>>> b = [3, 4]
>>> a.extend(b)
>>> print(a)
[1, 2, 3, 4]

But in your code, it gets a generator, which is good because:

  1. You don"t need to read the values twice.
  2. You may have a lot of children and you don"t want them all stored in memory.

And it works because Python does not care if the argument of a method is a list or not. Python expects iterables so it will work with strings, lists, tuples, and generators! This is called duck typing and is one of the reasons why Python is so cool. But this is another story, for another question...

You can stop here, or read a little bit to see an advanced use of a generator:

Controlling a generator exhaustion

>>> class Bank(): # Let"s create a bank, building ATMs
...    crisis = False
...    def create_atm(self):
...        while not self.crisis:
...            yield "$100"
>>> hsbc = Bank() # When everything"s ok the ATM gives you as much as you want
>>> corner_street_atm = hsbc.create_atm()
>>> print(
>>> print(
>>> print([ for cash in range(5)])
["$100", "$100", "$100", "$100", "$100"]
>>> hsbc.crisis = True # Crisis is coming, no more money!
>>> print(
<type "exceptions.StopIteration">
>>> wall_street_atm = hsbc.create_atm() # It"s even true for new ATMs
>>> print(
<type "exceptions.StopIteration">
>>> hsbc.crisis = False # The trouble is, even post-crisis the ATM remains empty
>>> print(
<type "exceptions.StopIteration">
>>> brand_new_atm = hsbc.create_atm() # Build a new one to get back in business
>>> for cash in brand_new_atm:
...    print cash

Note: For Python 3, useprint(corner_street_atm.__next__()) or print(next(corner_street_atm))

It can be useful for various things like controlling access to a resource.

Itertools, your best friend

The itertools module contains special functions to manipulate iterables. Ever wish to duplicate a generator? Chain two generators? Group values in a nested list with a one-liner? Map / Zip without creating another list?

Then just import itertools.

An example? Let"s see the possible orders of arrival for a four-horse race:

>>> horses = [1, 2, 3, 4]
>>> races = itertools.permutations(horses)
>>> print(races)
<itertools.permutations object at 0xb754f1dc>
>>> print(list(itertools.permutations(horses)))
[(1, 2, 3, 4),
 (1, 2, 4, 3),
 (1, 3, 2, 4),
 (1, 3, 4, 2),
 (1, 4, 2, 3),
 (1, 4, 3, 2),
 (2, 1, 3, 4),
 (2, 1, 4, 3),
 (2, 3, 1, 4),
 (2, 3, 4, 1),
 (2, 4, 1, 3),
 (2, 4, 3, 1),
 (3, 1, 2, 4),
 (3, 1, 4, 2),
 (3, 2, 1, 4),
 (3, 2, 4, 1),
 (3, 4, 1, 2),
 (3, 4, 2, 1),
 (4, 1, 2, 3),
 (4, 1, 3, 2),
 (4, 2, 1, 3),
 (4, 2, 3, 1),
 (4, 3, 1, 2),
 (4, 3, 2, 1)]

Understanding the inner mechanisms of iteration

Iteration is a process implying iterables (implementing the __iter__() method) and iterators (implementing the __next__() method). Iterables are any objects you can get an iterator from. Iterators are objects that let you iterate on iterables.

There is more about it in this article about how for loops work.

Answer #5

You might be interested in the SciPy Stats package. It has the percentile function you"re after and many other statistical goodies.

percentile() is available in numpy too.

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p

This ticket leads me to believe they won"t be integrating percentile() into numpy anytime soon.

Answer #6

Python 3.4 has statistics.median:

Return the median (middle value) of numeric data.

When the number of data points is odd, return the middle data point. When the number of data points is even, the median is interpolated by taking the average of the two middle values:

>>> median([1, 3, 5])
>>> median([1, 3, 5, 7])


import statistics

items = [6, 1, 8, 2, 3]

#>>> 3

It"s pretty careful with types, too:

statistics.median(map(float, items))
#>>> 3.0

from decimal import Decimal
statistics.median(map(Decimal, items))
#>>> Decimal("3")

Answer #7

Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.

Building on eumiro"s answer:

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m is on a reasonable relative scale.

Note that for the data[s<m] syntax to work, data must be a numpy array.

Answer #8

>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]

itertools often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-)

Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it"s so much easier that it offers good returns on efforts. But sometimes (essentially for "tragically crucial bottlenecks" in deep inner loops of code that"s pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one"s apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.

Careful measurements of "point" performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here. However, it"s easier to use it at a shell prompt. For example, here"s a short module to showcase the general approach for this problem, save it as

import itertools

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

def doset(k, map=map, list=list, set=set, tuple=tuple):
  return map(list, set(map(tuple, k)))

def dosort(k, sorted=sorted, xrange=xrange, len=len):
  ks = sorted(k)
  return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]

def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
  ks = sorted(k)
  return [i for i, _ in itertools.groupby(ks)]

def donewk(k):
  newk = []
  for i in k:
    if i not in newk:
  return newk

# sanity check that all functions compute the same result and don"t alter k
if __name__ == "__main__":
  savek = list(k)
  for f in doset, dosort, dogroupby, donewk:
    resk = f(k)
    assert k == savek
    print "%10s %s" % (f.__name__, sorted(resk))

Note the sanity check (performed when you just do python and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.

Now we can run checks on the tiny example list:

$ python -mtimeit -s"import nodup" "nodup.doset(nodup.k)"
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s"import nodup" "nodup.dosort(nodup.k)"
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s"import nodup" "nodup.dogroupby(nodup.k)"
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s"import nodup" "nodup.donewk(nodup.k)"
100000 loops, best of 3: 4.44 usec per loop

confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:

$ python -mtimeit -s"import nodup" "nodup.donewk([[i] for i in range(12)])"
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s"import nodup" "nodup.dogroupby([[i] for i in range(12)])"
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s"import nodup" "nodup.doset([[i] for i in range(12)])"
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s"import nodup" "nodup.dosort([[i] for i in range(12)])"
10000 loops, best of 3: 25 usec per loop

the quadratic approach isn"t bad, but the sort and groupby ones are better. Etc, etc.

If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it"s worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).

It"s also well worth considering keeping a different representation for k -- why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program"s performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.

Answer #9

(Works with ):

def median(lst):
    n = len(lst)
    s = sorted(lst)
    return (sum(s[n//2-1:n//2+1])/2.0, s[n//2])[n % 2] if n else None

>>> median([-5, -5, -3, -4, 0, -1])


>>> from numpy import median
>>> median([1, -4, -1, -1, 1, -3])

For , use statistics.median:

>>> from statistics import median
>>> median([5, 2, 3, 8, 9, -2])

Answer #10

Levenshtein Python extension and C library.

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

$ pip install python-levenshtein
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it"s usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it"s
    based on real minimal edit distance.

    >>> ratio("Hello world!", "Holly grail!")
    >>> ratio("Brian", "Jesus")

>>> help(Levenshtein.distance)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it"s hard to spell Levenshtein correctly):
    >>> distance("Levenshtein", "Lenvinsten")
    >>> distance("Levenshtein", "Levensthein")
    >>> distance("Levenshtein", "Levenshten")
    >>> distance("Levenshtein", "Levenshtein")

Get Solution for free from DataCamp guru