Select in DataFrame

Select Columns

Select a Column

There are two possible syntaxes for selecting all values from a column:

  1. table_name['column_name']

  2. table_name.column_name, if the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.).

When we select a single column, the result is called a Series.

Selecting Multiple Columns

To select two or more columns from a DataFrame, we use a list of the column names.

new_df = orders[['last_name', 'email']]

When we select multiple columns, the result is a new DataFrame.

Select Rows

Select Rows by Index

We can select a row by its index.

# select a row
orders.iloc[2]

# select multiple rows
orders.iloc[3:7]
orders.iloc[:4]
orders.iloc[-3:]

When we select a single row, the result is a Series. When we select multiple rows, the result is a new DataFrame.

Select Rows with Logic

You can select a subset of a DataFrame by using logical statements:

df[df.MyColumnName == desired_column_value]

Logical statements include:

  • Equal to, ==

  • Greater Than, >

  • Less Than, <

  • Not Equal, !=

We can also combine multiple logical statements, as long as each statement is in parentheses.

df[(df.age < 30) |
   (df.name == 'Martha Jones')]

We could use the isin command to check that df.name is one of a list of values:

df[df.name.isin(['Martha Jones',
     'Rose Tyler',
     'Amy Pond'])]

Setting indices

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use .iloc().

We can fix this using the method .reset_index(). For example, here is a DataFrame called df with non-consecutive indices:

First Name

Last Name

0

John

Smith

4

Jane

Doe

7

Joe

Schmo

If we use the command df.reset_index(), we get a new DataFrame with a new set of indices:

index

First Name

Last Name

0

0

John

Smith

1

4

Jane

Doe

2

7

Joe

Schmo

Note that the old indices have been moved into a new column called 'index'. Unless you need those values for something special, it’s probably better to use the keyword drop=True so that you don’t end up with that extra column. If we run the command df.reset_index(drop=True), we get a new DataFrame that looks like this:

First Name

Last Name

0

John

Smith

1

Jane

Doe

2

Joe

Schmo

Using .reset_index() will return a new DataFrame, but we usually just want to modify our existing DataFrame. If we use the keyword inplace=True we can just modify our existing DataFrame.

Last updated

Was this helpful?