Aggregate

Aggregate over a column

Pandas’ aggregate statistics functions can be used to calculate statistics on a column of a DataFrame.The general syntax for these calculations is:

df.column_name.command()

The following table summarizes some common commands:

Command

Description

mean

Average of all values in column

std

Standard deviation

median

Median

max

Maximum value in column

min

Minimum value in column

count

Number of values in column

nunique

Number of unique values in column

unique

List of unique values in column

Aggregate over groups

When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data. In general, we use the following syntax to calculate aggregates:

df.groupby('column1').column2.measurement()

where:

  • column1 is the column that we want to group by

  • column2 is the column that we want to perform a measurement on

  • measurement is the measurement function we want to apply

The output will be a series, not a DataFrame.

pricey_shoes = orders.groupby('shoe_type') \
               .price.max()
print(pricey_shoes)
# shoe_type
# ballet flats    498
# sandals         498
# stilettos       468
# wedges          488
# Name: price, dtype: int64

print(type(pricey_shoes))
# <class 'pandas.core.series.Series'>

In this example, the indices of the Series were different values of shoe_type, and the name property was price.

Usually, we’d prefer that those indices were actually a column. In order to get that, we can use reset_index(). This will transform our Series into a DataFrame and move the indices into their own column.

Generally, you’ll always see a groupby() statement followed by reset_index():

df.groupby('column1').column2.measurement()
    .reset_index()

For the previous example, after using reset_index() , the output is:

pricey_shoes = orders.groupby('shoe_type') \
                .price.max().reset_index()
print(pricey_shoes)
#     shoe_type  price
# 0  ballet flats    498
# 1       sandals    498
# 2     stilettos    468
# 3        wedges    488

print(type(pricey_shoes))
# <class 'pandas.core.frame.DataFrame'>

Note that to use the groupby() function, at least two columns must be supplied.

Aggregate using lambda functions

Sometimes, the operation that you want to perform is more complicated than mean or count. In those cases, you can use the apply method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.

A great example of this is calculating percentiles. Suppose we have a DataFrame of employee information called df that has the following columns:

  • id: the employee’s id number

  • name: the employee’s name

  • wage: the employee’s hourly wage

  • category: the type of work that the employee does

Our data might look something like this:

id

name

wage

category

10131

Sarah Carney

39

product

14189

Heather Carey

17

design

15004

Gary Mercado

33

marketing

11204

Cora Copaz

27

design

If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each category, we can use the following combination of apply and a lambda function:

# np.percentile can calculate any percentile 
# over an array of values
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()

The output, high_earners might look like this: Sometimes, the operation that you want to perform is more complicated than mean or count. In those cases, you can use the apply method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.

A great example of this is calculating percentiles. Suppose we have a DataFrame of employee information called df that has the following columns:

  • id: the employee’s id number

  • name: the employee’s name

  • wage: the employee’s hourly wage

  • category: the type of work that the employee does

Our data might look something like this:

id

name

wage

category

10131

Sarah Carney

39

product

14189

Heather Carey

17

design

15004

Gary Mercado

33

marketing

11204

Cora Copaz

27

design

If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each category, we can use the following combination of apply and a lambda function:

# np.percentile can calculate any percentile over an array of values
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()

The output, high_earners might look like this:

category

wage

0

design

23

1

marketing

35

2

product

48

Pivot Tables

When we perform a groupby() on 2 columns, we may want the resulting DataFrame to be wide instead of long. pivot() function can help transfer a long DataFrame to a wide Dataframe.

In Pandas, the command for pivot() is:

df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')

Here is a example:

# First use the groupby statement:
unpivoted = df.groupby(['Location', 'Day of Week'])['Total Sales']
                .mean().reset_index()

# Now pivot the table
pivoted = unpivoted.pivot(
    columns='Day of Week',
    index='Location',
    values='Total Sales')

Just like with groupby(), the output of a pivot command is a new DataFrame, but the indexing tends to be “weird”, so we usually follow up with .reset_index().

Last updated

Was this helpful?