4 minute read

We want to get index of rows that matches a specific column value or a condition based on multiple columns. The pandas index attribute get you the index of dataframe. There are several ways which can be used to get the index of the rows based on column values and conditions.

In this post we will see all such ways that can give us index of the regular and multiindex(hierarchical index) index dataframe.

we will also use numpy functions that can give us the index directly without using the pandas index attribute of the Dataframe.

So let’s create a dataset first to work with, we will create a dataframe of Metals and Prices. There are 50 rows in this data

df = pd.DataFrame({'Metal':np.random.choice(['Gold', 'Platinum', 'Silver'], size=50), 'Price': np.random.randint(0, 100, 50)})
df.sample(5)
  Metal Price
0 Gold 88
1 Platinum 90
2 Silver 76
3 Gold 84
4 Gold 74

Get row Index using Boolean Indexing

We will use pandas popular feature which is known as Boolean indexing to get the subset of data which meets a specific value or condition based on column value

So we want the index of only Gold and Silver from this Dataframe

df[df['Metal'].isin(['Gold','Silver'])].index

Or we can also use .loc on the dataframe

df.loc[df['Metal'].isin(['Gold','Silver'])].index

It returns the index of all the rows where metal column has a value either Gold or Silver.

Output:

Int64Index([ 1,  2,  3,  5,  6,  7,  8, 10, 11, 13, 14, 17, 19, 23, 25, 26, 27,
            29, 31, 32, 37, 38, 39, 40, 41, 42, 43, 47, 48, 49],
           dtype='int64')

Get row Index using Numpy Functions

Alternatively, we can use numpy where and flatnonzero functions to get the row index directly based on any condition. flatnonzero function return indices that are non-zero in flattened version of array, whereas numpy where also returns the indices where condition is True

np.where(df['Metal'].isin(['Gold','Silver']))

Or

np.flatnonzero(df['Metal'].isin(['Gold','Silver']))

The output is same as the above, Returns the indices of matching rows where Metal is either Gold or Silver

Output:

array([ 1,  2,  3,  5,  6,  7,  8, 10, 11, 13, 14, 17, 19, 23, 25, 26, 27,
       29, 31, 32, 37, 38, 39, 40, 41, 42, 43, 47, 48, 49], dtype=int64)

Get row Index using query and eval

We could use numpy query and eval functions to evaluates a string describing operations on DataFrame column and get the rows based on conditions and then using index attribute to get the indices of all such rows in the dataframe.

You can read more about query and eval functions in this post.

In this case, our condition is, Metal is either Gold or Platinum and the Price is greater than 90

df[df.eval('Metal == "Gold" | Metal == "Platinum" & Price > 90')].index

Or

df.query('Metal == "Gold" | Metal == "Platinum" & Price > 90').index

The output is the indices of rows matching the above condition

Output:

Int64Index([1, 3, 7, 8, 11, 13, 14, 16, 20, 21, 22, 25, 30, 31, 33, 38, 40,
            48],
           dtype='int64')

Get row Index using Regex

We can also filter the rows based on a regular expression matching the value in the column. In this case we are using regex to select all rows where column Metal value starts with either character G or S

df.Metal[df.Metal.str.match(r'(^G.*)|(^S.*)')==True].index

Output:

Int64Index([ 1,  2,  3,  5,  6,  7,  8, 10, 11, 13, 14, 17, 19, 23, 25, 26, 27,
            29, 31, 32, 37, 38, 39, 40, 41, 42, 43, 47, 48, 49],
           dtype='int64')

Get row Index in a MultiIndex dataframe

Let’s create a MultiIndex dataframe first

x = np.round(np.random.uniform(1, 5, size=(9, 4)), 2)

rowIndx = pd.MultiIndex.from_product(
    [["East", "North", "South"], ["A", "B", "C"]],
    names=["Region", "Division"],
)
colIndex = pd.MultiIndex.from_product(
    [["Q1", "Q2 "], ["Buy", "Sell"]]
)
multidf = pd.DataFrame(data=x, index=rowIndx, columns=colIndex)
multidf
  Region Division Q1-Buy Q-Sell Q2-Buy Q2-Sell
  East A 2.18 2.92 4.06 2.92
    B 2.44 2.86 4.79 2.93
    C 3.78 1.16 1.71 4.61
  North A 4.90 2.25 3.36 3.55
    B 2.14 4.17 1.97 3.82
    C 1.33 1.24 2.51 2.80
  South A 2.03 1.43 2.94 2.62
    B 1.83 3.25 2.93 2.81
    C 1.20 3.97 2.95 3.41

We want the Index for the first column level - Q1 and Q2

multidf.loc[:, (multidf.columns.get_level_values(level=0)=='Q1')].index

It returns the MultiIndex values i.e. Region and Division for first level column

Output:

MultiIndex([( 'East', 'A'),
            ( 'East', 'B'),
            ( 'East', 'C'),
            ('North', 'A'),
            ('North', 'B'),
            ('North', 'C'),
            ('South', 'A'),
            ('South', 'B'),
            ('South', 'C')],
           names=['Region', 'Division'])

Next we want to see just first level of index for this column level=0 and just the unique values

multidf.loc[:, (multidf.columns.get_level_values(level=0)=='Q1')].index.get_level_values(1).unique()

Output:

Index(['A', 'B', 'C'], dtype='object', name='Division')

Similarly for all unique index at level=0 you can change the get_level_values to zero

Next, we want the index of the rows where column Q1 - Buy is greater than 2.5, we will filter the rows where this condition is matched and then uses index attribute to get the multiIndex values

multidf[multidf[( 'Q1',  'Buy')]>2.5].index
MultiIndex([( 'East', 'C'),
            ('North', 'A'),
            ('North', 'B'),
            ('South', 'A')],
           names=['Region', 'Division'])

Tags: ,

Categories: ,

Updated: