Python | Pandas Series.nunique ()

nunique | Python Methods and Functions

When parsing data, a user many times wants to see unique values ​​in a particular column. Pandas nunique() is used to count unique values.

To download the CSV file you are using, click here .

Syntax: Series.nunique (dropna = True)

Parameters:
dropna: Exclude NULL value if True

Return Type: Integer - Number of unique values ​​in a column.

Example # 1: Using nunique ()
This example uses the nunique () method to get the number of all unique values ​​in the Team column.

# pandas package import

import pandas as pd

 
# create data frame from CSV file

data = pd.read_csv ( "employees .csv " )

  
# store a unique value in a variable

unique_value = data [ " Team " ]. nunique ()

 
# print value

print (unique_value)

Exit:
Outputting the number of unique values ​​is returned.

 10 

Example # 2: Handling NULL
This example compares the length of the array returned by the unique () method to the integer returned by the nunique () method.

 

# package import pandas

import pandas as pd

 
# create a data frame from a CSV file

data = pd.read_csv ( "employees.csv" )

 
# store the unique value in variable

arr = data [ "Team" ]. unique ()

 
# store a unique value in a variable

unique_value   = data [ "Team" ] .nunique (dropna = True )

 
# printable values ​​

print ( len (arr), unique_value)

Output:
Imprint are not the same in both cases because dropna is True and therefore NULL values ​​were excluded when counting unique values.

 11 10 




Python | Pandas Series.nunique (): StackOverflow Questions

Answer #1

You need nunique:

df = df.groupby("domain")["ID"].nunique()

print (df)
domain
"facebook.com"    1
"google.com"      1
"twitter.com"     2
"vk.com"          3
Name: ID, dtype: int64

If you need to strip " characters:

df = df.ID.groupby([df.domain.str.strip(""")]).nunique()
print (df)
domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
Name: ID, dtype: int64

Or as Jon Clements commented:

df.groupby(df.domain.str.strip("""))["ID"].nunique()

You can retain the column name like this:

df = df.groupby(by="domain", as_index=False).agg({"ID": pd.Series.nunique})
print(df)
    domain  ID
0       fb   1
1      ggl   1
2  twitter   2
3       vk   3

The difference is that nunique() returns a Series and agg() returns a DataFrame.

Answer #2

Generally to count distinct values in single column, you can use Series.value_counts:

df.domain.value_counts()

#"vk.com"          5
#"twitter.com"     2
#"facebook.com"    1
#"google.com"      1
#Name: domain, dtype: int64

To see how many unique values in a column, use Series.nunique:

df.domain.nunique()
# 4

To get all these distinct values, you can use unique or drop_duplicates, the slight difference between the two functions is that unique return a numpy.array while drop_duplicates returns a pandas.Series:

df.domain.unique()
# array([""vk.com"", ""twitter.com"", ""facebook.com"", ""google.com""], dtype=object)

df.domain.drop_duplicates()
#0          "vk.com"
#2     "twitter.com"
#4    "facebook.com"
#6      "google.com"
#Name: domain, dtype: object

As for this specific problem, since you"d like to count distinct value with respect to another variable, besides groupby method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts():

import pandas as pd
df.drop_duplicates().domain.value_counts()

# "vk.com"          3
# "twitter.com"     2
# "facebook.com"    1
# "google.com"      1
# Name: domain, dtype: int64

Answer #3

Count distinct values, use nunique:

df["hID"].nunique()
5

Count only non-null values, use count:

df["hID"].count()
8

Count total values including null values, use the size attribute:

df["hID"].size
8

Edit to add condition

Use boolean indexing:

df.loc[df["mID"]=="A","hID"].agg(["nunique","count","size"])

OR using query:

df.query("mID == "A"")["hID"].agg(["nunique","count","size"])

Output:

nunique    5
count      5
size       5
Name: hID, dtype: int64

Answer #4

"nunique" is an option for .agg() since pandas 0.20.0, so:

df.groupby("date").agg({"duration": "sum", "user_id": "nunique"})

Answer #5

I believe this is what you want:

table.groupby("YEARMONTH").CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby("YEARMONTH").CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

Answer #6

How about either of:

>>> df
         date  duration user_id
0  2013-04-01        30    0001
1  2013-04-01        15    0001
2  2013-04-01        20    0002
3  2013-04-02        15    0002
4  2013-04-02        30    0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1

Tutorials