Pandas aggregate count distinct

nunique | StackOverflow

Let"s say I have a log of user activity and I want to generate a report of total duration and the number of unique users per day.

import numpy as np
import pandas as pd
df = pd.DataFrame({"date": ["2013-04-01","2013-04-01","2013-04-01","2013-04-02", "2013-04-02"],
    "user_id": ["0001", "0001", "0002", "0002", "0002"],
    "duration": [30, 15, 20, 15, 30]})

Aggregating duration is pretty straightforward:

group = df.groupby("date")
agg = group.aggregate({"duration": np.sum})
agg
            duration
date
2013-04-01        65
2013-04-02        45

What I"d like to do is sum the duration and count distincts at the same time, but I can"t seem to find an equivalent for count_distinct:

agg = group.aggregate({ "duration": np.sum, "user_id": count_distinct})

This works, but surely there"s a better way, no?

group = df.groupby("date")
agg = group.aggregate({"duration": np.sum})
agg["uv"] = df.groupby("date").user_id.nunique()
agg
            duration  uv
date
2013-04-01        65   2
2013-04-02        45   1

I"m thinking I just need to provide a function that returns the count of distinct items of a Series object to the aggregate function, but I don"t have a lot of exposure to the various libraries at my disposal. Also, it seems that the groupby object already knows this information, so wouldn"t I just be duplicating effort?

Answer rating: 183

How about either of:

>>> df
         date  duration user_id
0  2013-04-01        30    0001
1  2013-04-01        15    0001
2  2013-04-01        20    0002
3  2013-04-02        15    0002
4  2013-04-02        30    0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1

Answer rating: 84

"nunique" is an option for .agg() since pandas 0.20.0, so:

df.groupby("date").agg({"duration": "sum", "user_id": "nunique"})




Pandas aggregate count distinct: StackOverflow Questions

Answer #1

You need nunique:

df = df.groupby("domain")["ID"].nunique()

print (df)
domain
"facebook.com"    1
"google.com"      1
"twitter.com"     2
"vk.com"          3
Name: ID, dtype: int64

If you need to strip " characters:

df = df.ID.groupby([df.domain.str.strip(""")]).nunique()
print (df)
domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
Name: ID, dtype: int64

Or as Jon Clements commented:

df.groupby(df.domain.str.strip("""))["ID"].nunique()

You can retain the column name like this:

df = df.groupby(by="domain", as_index=False).agg({"ID": pd.Series.nunique})
print(df)
    domain  ID
0       fb   1
1      ggl   1
2  twitter   2
3       vk   3

The difference is that nunique() returns a Series and agg() returns a DataFrame.

Answer #2

Generally to count distinct values in single column, you can use Series.value_counts:

df.domain.value_counts()

#"vk.com"          5
#"twitter.com"     2
#"facebook.com"    1
#"google.com"      1
#Name: domain, dtype: int64

To see how many unique values in a column, use Series.nunique:

df.domain.nunique()
# 4

To get all these distinct values, you can use unique or drop_duplicates, the slight difference between the two functions is that unique return a numpy.array while drop_duplicates returns a pandas.Series:

df.domain.unique()
# array([""vk.com"", ""twitter.com"", ""facebook.com"", ""google.com""], dtype=object)

df.domain.drop_duplicates()
#0          "vk.com"
#2     "twitter.com"
#4    "facebook.com"
#6      "google.com"
#Name: domain, dtype: object

As for this specific problem, since you"d like to count distinct value with respect to another variable, besides groupby method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts():

import pandas as pd
df.drop_duplicates().domain.value_counts()

# "vk.com"          3
# "twitter.com"     2
# "facebook.com"    1
# "google.com"      1
# Name: domain, dtype: int64

Answer #3

Count distinct values, use nunique:

df["hID"].nunique()
5

Count only non-null values, use count:

df["hID"].count()
8

Count total values including null values, use the size attribute:

df["hID"].size
8

Edit to add condition

Use boolean indexing:

df.loc[df["mID"]=="A","hID"].agg(["nunique","count","size"])

OR using query:

df.query("mID == "A"")["hID"].agg(["nunique","count","size"])

Output:

nunique    5
count      5
size       5
Name: hID, dtype: int64

Answer #4

"nunique" is an option for .agg() since pandas 0.20.0, so:

df.groupby("date").agg({"duration": "sum", "user_id": "nunique"})

Answer #5

I believe this is what you want:

table.groupby("YEARMONTH").CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby("YEARMONTH").CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

Answer #6

How about either of:

>>> df
         date  duration user_id
0  2013-04-01        30    0001
1  2013-04-01        15    0001
2  2013-04-01        20    0002
3  2013-04-02        15    0002
4  2013-04-02        30    0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1

Tutorials