Aggregate
Aggregate over a column
Pandas’ aggregate statistics functions can be used to calculate statistics on a column of a DataFrame.The general syntax for these calculations is:
The following table summarizes some common commands:
Command
Description
mean
Average of all values in column
std
Standard deviation
median
Median
max
Maximum value in column
min
Minimum value in column
count
Number of values in column
nunique
Number of unique values in column
unique
List of unique values in column
Aggregate over groups
When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data. In general, we use the following syntax to calculate aggregates:
where:
column1
is the column that we want to group bycolumn2
is the column that we want to perform a measurement onmeasurement
is the measurement function we want to apply
The output will be a series, not a DataFrame.
In this example, the indices of the Series were different values of shoe_type
, and the name property was price
.
Usually, we’d prefer that those indices were actually a column. In order to get that, we can use reset_index()
. This will transform our Series into a DataFrame and move the indices into their own column.
Generally, you’ll always see a groupby()
statement followed by reset_index()
:
For the previous example, after using reset_index()
, the output is:
Note that to use the groupby()
function, at least two columns must be supplied.
Aggregate using lambda functions
Sometimes, the operation that you want to perform is more complicated than mean
or count
. In those cases, you can use the apply
method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.
A great example of this is calculating percentiles. Suppose we have a DataFrame of employee information called df
that has the following columns:
id
: the employee’s id numbername
: the employee’s namewage
: the employee’s hourly wagecategory
: the type of work that the employee does
Our data might look something like this:
id
name
wage
category
10131
Sarah Carney
39
product
14189
Heather Carey
17
design
15004
Gary Mercado
33
marketing
11204
Cora Copaz
27
design
…
If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each category
, we can use the following combination of apply
and a lambda function:
The output, high_earners
might look like this:
Sometimes, the operation that you want to perform is more complicated than mean
or count
. In those cases, you can use the apply
method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.
A great example of this is calculating percentiles. Suppose we have a DataFrame of employee information called df
that has the following columns:
id
: the employee’s id numbername
: the employee’s namewage
: the employee’s hourly wagecategory
: the type of work that the employee does
Our data might look something like this:
id
name
wage
category
10131
Sarah Carney
39
product
14189
Heather Carey
17
design
15004
Gary Mercado
33
marketing
11204
Cora Copaz
27
design
…
If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each category
, we can use the following combination of apply
and a lambda function:
The output, high_earners
might look like this:
category
wage
0
design
23
1
marketing
35
2
product
48
Pivot Tables
When we perform a groupby()
on 2 columns, we may want the resulting DataFrame to be wide instead of long. pivot()
function can help transfer a long DataFrame to a wide Dataframe.
In Pandas, the command for pivot()
is:
Here is a example:
Just like with groupby()
, the output of a pivot command is a new DataFrame, but the indexing tends to be “weird”, so we usually follow up with .reset_index()
.
Last updated
Was this helpful?