> For the complete documentation index, see [llms.txt](https://lei-d.gitbook.io/python-for-data-analysis/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://lei-d.gitbook.io/python-for-data-analysis/pandas-1/aggregate.md).

# Aggregate

## Aggregate over a column

Pandas’ aggregate statistics functions can be used to calculate statistics on a column of a DataFrame.The general syntax for these calculations is:

```python
df.column_name.command()
```

The following table summarizes some common commands:

| Command   | Description                       |
| --------- | --------------------------------- |
| `mean`    | Average of all values in column   |
| `std`     | Standard deviation                |
| `median`  | Median                            |
| `max`     | Maximum value in column           |
| `min`     | Minimum value in column           |
| `count`   | Number of values in column        |
| `nunique` | Number of unique values in column |
| `unique`  | List of unique values in column   |

## Aggregate over groups

When we have a bunch of data, we often want to calculate aggregate statistics (mean, standard deviation, median, percentiles, etc.) over certain subsets of the data. In general, we use the following syntax to calculate aggregates:

```python
df.groupby('column1').column2.measurement()
```

where:

* `column1` is the column that we want to group by
* `column2` is the column that we want to perform a measurement on&#x20;
* `measurement` is the measurement function we want to apply

The output will be a series, not a DataFrame.

```python
pricey_shoes = orders.groupby('shoe_type') \
               .price.max()
print(pricey_shoes)
# shoe_type
# ballet flats    498
# sandals         498
# stilettos       468
# wedges          488
# Name: price, dtype: int64

print(type(pricey_shoes))
# <class 'pandas.core.series.Series'>
```

In this example, the indices of the Series were different values of `shoe_type`, and the name property was `price`.&#x20;

Usually, we’d prefer that those indices were actually a column. In order to get that, we can use `reset_index()`. This will transform our Series into a DataFrame and move the indices into their own column.

Generally, you’ll always see a `groupby()` statement followed by `reset_index()`:

```python
df.groupby('column1').column2.measurement()
    .reset_index()
```

For the previous example, after using `reset_index()` , the output is:

```python
pricey_shoes = orders.groupby('shoe_type') \
                .price.max().reset_index()
print(pricey_shoes)
#     shoe_type  price
# 0  ballet flats    498
# 1       sandals    498
# 2     stilettos    468
# 3        wedges    488

print(type(pricey_shoes))
# <class 'pandas.core.frame.DataFrame'>
```

Note that to use the `groupby()` function, at least two columns must be supplied.

## Aggregate using lambda functions

Sometimes, the operation that you want to perform is more complicated than `mean` or `count`. In those cases, you can use the `apply` method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.

A great example of this is calculating percentiles. Suppose we have a DataFrame of employee information called `df` that has the following columns:

* `id`: the employee’s id number
* `name`: the employee’s name
* `wage`: the employee’s hourly wage
* `category`: the type of work that the employee does

Our data might look something like this:

| id    | name          | wage | category  |
| ----- | ------------- | ---- | --------- |
| 10131 | Sarah Carney  | 39   | product   |
| 14189 | Heather Carey | 17   | design    |
| 15004 | Gary Mercado  | 33   | marketing |
| 11204 | Cora Copaz    | 27   | design    |
| …     |               |      |           |

If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each `category`, we can use the following combination of `apply` and a lambda function:

```python
# np.percentile can calculate any percentile 
# over an array of values
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()
```

The output, `high_earners` might look like this:\
Sometimes, the operation that you want to perform is more complicated than `mean` or `count`. In those cases, you can use the `apply` method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.

A great example of this is calculating percentiles. Suppose we have a DataFrame of employee information called `df` that has the following columns:

* `id`: the employee’s id number
* `name`: the employee’s name
* `wage`: the employee’s hourly wage
* `category`: the type of work that the employee does

Our data might look something like this:

| id    | name          | wage | category  |
| ----- | ------------- | ---- | --------- |
| 10131 | Sarah Carney  | 39   | product   |
| 14189 | Heather Carey | 17   | design    |
| 15004 | Gary Mercado  | 33   | marketing |
| 11204 | Cora Copaz    | 27   | design    |
| …     |               |      |           |

If we want to calculate the 75th percentile (i.e., the point at which 75% of employees have a lower wage and 25% have a higher wage) for each `category`, we can use the following combination of `apply` and a lambda function:

```
# np.percentile can calculate any percentile over an array of values
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()
```

The output, `high_earners` might look like this:

|   | category  | wage |
| - | --------- | ---- |
| 0 | design    | 23   |
| 1 | marketing | 35   |
| 2 | product   | 48   |

## Pivot Tables

When we perform a `groupby()` on 2 columns, we may want the resulting DataFrame to be wide instead of long. `pivot()` function can help transfer a long DataFrame to a wide Dataframe.

In Pandas, the command for `pivot()` is:

```
df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')
```

Here is a example:

```
# First use the groupby statement:
unpivoted = df.groupby(['Location', 'Day of Week'])['Total Sales']
                .mean().reset_index()

# Now pivot the table
pivoted = unpivoted.pivot(
    columns='Day of Week',
    index='Location',
    values='Total Sales')
```

Just like with `groupby()`, the output of a pivot command is a new DataFrame, but the indexing tends to be “weird”, so we usually follow up with **`.reset_index()`**.<br>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://lei-d.gitbook.io/python-for-data-analysis/pandas-1/aggregate.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
