pandas DataFrame nunique

nunique | Python Methods and Functions




pandas DataFrame.nunique function

DataFrame.nunique (axis=0, dropna=True)[source] Counts number of distinct elements in specified axis. Returns Series with number of distinct elements. Can ignore NaN values.

Name Description Type/Default Value Required / Optional
axis The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise. {0 or ‘index’, 1 or ‘columns’}
Default Value: 0
Required
dropna Don’t include NaN in the counts. bool
Default Value: True
Required
Python is a great language for data analysis, mainly because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and it makes importing and analyzing data a lot easier. The Pandas dataframe.nunique () function returns the series with the number of distinct observations on the requested axis. If we set the axis value to 0, it finds the total number of unique observations on the index axis. If we set the axis value to 1, it finds the total number of unique observations on the column axis. It also provides the functionality to exclude NaN values ​​from the unique number count.

pandas DataFrame nunique Example #1


def get_nunique(self, colname):
        """
        Looks up or caches the number of unique (distinct) values in a column,
        or calculates and caches it.
        """
        return self.get_cached_value('nunique', colname, self.calc_nunique) 

pandas DataFrame nunique Example #2


def get_database_nunique(self, tablename, colname):
        colname = self.quoted(colname)
        sql = ('SELECT COUNT(DISTINCT %s) FROM %s WHERE %s IS NOT NULL'
               % (colname, tablename, colname))
        return self.execute_scalar(sql) 

pandas DataFrame nunique Example #3

Use nunique() function to find the number of unique values over the column axis.

# importing pandas as pd
import pandas as pd
  
# Creating the first dataframe 
df = pd.DataFrame({"A":[14, 4, 5, 4, 1],
                   "B":[5, 2, 54, 3, 2],
                   "C":[20, 20, 7, 3, 8],
                    "D":[14, 3, 6, 2, 6]})
  
# Print the dataframe
df

pandas DataFrame nunique Example #4

Use nunique() function to find the number of unique values over the index axis in a dataframe. The dataframe contains NaN values

# importing pandas as pd
import pandas as pd

# Creating the first dataframe
df = pd.DataFrame({"A":["Sandy", "alex", "brook", "kelly", np.nan],
				"B":[np.nan, "olivia", "olivia", "", "amanda"],
				"C":[20 + 5j, 20 + 5j, 7, None, 8],
				"D":[14.8, 3, None, 6, 6]})

# apply the nunique() function
df.nunique(axis = 0, dropna = True)




Archived version

The Pandas function dataframe.nunique() returns a series with the number of different observations along the requested axis. If we set the axis value to 0, then it will find the total number of unique observations along the index axis. If we set the axis value to 1, we get the total number of unique observations along the column axis. It also provides a function to exclude NaN values ​​from unique numbers.

Syntax: DataFrame.nunique (axis = 0, dropna = True)

Parameters:
axis: {0 or 'index', 1 or 'columns'}, default 0
dropna: Don't include NaN in the counts.

Returns: nunique: Series

Example # 1: Use nunique () to find the number of unique values ​​along the column axis.

# import pandas as pd

import pandas as pd

 
# Create first data frame

df = pd.DataFrame ( { "A" : [ 14 , 4 , 5 , 4 , 1 ],

"B" : [ 5 , 2 , 54 , 3 , 2 ],

"C" : [ 20 , 20 , 7 , 3 , 8 ],

"D" : [ 14 , 3 , 6 , 2 , 6 ]})

 
# Print the data frame
df

Let's use the dataframe.nunique () function to find unique values ​​along the column axis.

# find unique values ​​

df.nunique (axis = 1 )

Output:

As we can see in the output, the function prints the total number. unique values ​​in each row.

Example # 2: Use nunique () to find the number of unique values ​​along the index axis in a data frame. The data frame contains NaN values.

# import pandas as pd

import pandas as pd

 
# Create first data frame

df = pd.DataFrame ({ "A" : [ " Sandy " , " alex " , "brook" , "kelly" , np.nan],

  " B " : [np.nan, "olivia" , "olivia" , " ", " amanda "], 

  " C " : [ 20 + 5j , 20 + 5j , 7 , None , 8 ],

"D" : [ 14.8 , 3 , None , 6 , 6 ]})

  
# apply nunique () function

df.nunique (axis = 0 , dropna = True )

Output:

The function treats an empty string as a unique value in column 2.





pandas DataFrame nunique: StackOverflow Questions

Answer #1

You need nunique:

df = df.groupby("domain")["ID"].nunique()

print (df)
domain
"facebook.com"    1
"google.com"      1
"twitter.com"     2
"vk.com"          3
Name: ID, dtype: int64

If you need to strip " characters:

df = df.ID.groupby([df.domain.str.strip(""")]).nunique()
print (df)
domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
Name: ID, dtype: int64

Or as Jon Clements commented:

df.groupby(df.domain.str.strip("""))["ID"].nunique()

You can retain the column name like this:

df = df.groupby(by="domain", as_index=False).agg({"ID": pd.Series.nunique})
print(df)
    domain  ID
0       fb   1
1      ggl   1
2  twitter   2
3       vk   3

The difference is that nunique() returns a Series and agg() returns a DataFrame.

Answer #2

Generally to count distinct values in single column, you can use Series.value_counts:

df.domain.value_counts()

#"vk.com"          5
#"twitter.com"     2
#"facebook.com"    1
#"google.com"      1
#Name: domain, dtype: int64

To see how many unique values in a column, use Series.nunique:

df.domain.nunique()
# 4

To get all these distinct values, you can use unique or drop_duplicates, the slight difference between the two functions is that unique return a numpy.array while drop_duplicates returns a pandas.Series:

df.domain.unique()
# array([""vk.com"", ""twitter.com"", ""facebook.com"", ""google.com""], dtype=object)

df.domain.drop_duplicates()
#0          "vk.com"
#2     "twitter.com"
#4    "facebook.com"
#6      "google.com"
#Name: domain, dtype: object

As for this specific problem, since you"d like to count distinct value with respect to another variable, besides groupby method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts():

import pandas as pd
df.drop_duplicates().domain.value_counts()

# "vk.com"          3
# "twitter.com"     2
# "facebook.com"    1
# "google.com"      1
# Name: domain, dtype: int64

Answer #3

Count distinct values, use nunique:

df["hID"].nunique()
5

Count only non-null values, use count:

df["hID"].count()
8

Count total values including null values, use the size attribute:

df["hID"].size
8

Edit to add condition

Use boolean indexing:

df.loc[df["mID"]=="A","hID"].agg(["nunique","count","size"])

OR using query:

df.query("mID == "A"")["hID"].agg(["nunique","count","size"])

Output:

nunique    5
count      5
size       5
Name: hID, dtype: int64

Answer #4

"nunique" is an option for .agg() since pandas 0.20.0, so:

df.groupby("date").agg({"duration": "sum", "user_id": "nunique"})

Answer #5

I believe this is what you want:

table.groupby("YEARMONTH").CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby("YEARMONTH").CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

Answer #6

How about either of:

>>> df
         date  duration user_id
0  2013-04-01        30    0001
1  2013-04-01        15    0001
2  2013-04-01        20    0002
3  2013-04-02        15    0002
4  2013-04-02        30    0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1

Get Solution for free from DataCamp guru