Pandas iterate over rows and update or Update dataframe row values where certain condition is met
We want to iterate over the rows of a dataframe and update the values based on condition. There are three different pandas function available that let you iterate through the dataframe rows and columns of a dataframe.
- Dataframe.iterrows()
- Dataframe.itertuples()
- Dataframe.items()
Before we dive into these three functions, Let me make it very clear that iterating through a dataframe rows and columns should be the last resort since it’s slow and not worth it. we can achieve anything using vectorization, loc and apply function.
You should only use these functions when you are exhausted of all the solutions for your problem
Pandas update dataframe rows with condition
Let’s create a dataframe of students with their marks and grade
df = pd.DataFrame({'name': ['Alex', 'Harry', 'Brenden', 'Alice', 'Ronald', 'Justin'],
'marks': [50, 42, 35, 68, 47, 62],
'grade': ['B', 'B', 'B', 'A+', 'B+','A+']})
df
name | marks | Score | |
---|---|---|---|
0 | Alex | 50 | B+ |
1 | Harry | 42 | B |
2 | Brenden | 35 | B- |
3 | Alice | 68 | A+ |
4 | Ronald | 47 | B+ |
5 | Justin | 62 | A+ |
Condition: We want to upgrade the grade of all students to A+ who has scored greater than 40 and less than 50
Update the dataframe values using Vectorization
Let’s update the value of dataframe using vectorization. We need to first create a function that supports vectorization.
We’ve used numpy.where to get the index of all such rows that meets the condition and returns the new grade otherwise it returns the existing value. The return value of numpy.where condition is stored in the column grade.
Numpy where supports vectorization and provides users with a wide variety of functions capable of performing operations on arrays of data. Its use of vectorization makes these custom functions incredibly fast.
def update_color_col(x):
x.grade = np.where(((x.score>40) & (x.score<50)),'A+',x.grade)
Let’s call this function and pass the above created dataframe
update_color_col(df)
df
name | marks | Score | |
---|---|---|---|
0 | Alex | 50 | B+ |
1 | Harry | 42 | A+ |
2 | Brenden | 35 | B- |
3 | Alice | 68 | A+ |
4 | Ronald | 47 | A+ |
5 | Justin | 62 | A+ |
Let’s timeit this vectorization process
%%timeit
update_color_col(df)
276 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Update the dataframe values using Apply
We are updating the grade column using the lambda function inside apply, The axis is set to 1, to apply this function accross the dataframe columns.
df["grade"] = df.apply(lambda x: 'A+' if ((x['score']>40)&(x['score']<50)) else x["grade"], axis=1)
name | marks | Score | |
---|---|---|---|
0 | Alex | 50 | B+ |
1 | Harry | 42 | A+ |
2 | Brenden | 35 | B- |
3 | Alice | 68 | A+ |
4 | Ronald | 47 | A+ |
5 | Justin | 62 | A+ |
%%timeit
df['color'] = df.apply(update_color_col,axis=1)
707 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
We can see that the apply method is slower than the Vectorization approach above, The Vectorization approach is 10x faster than using apply.
I think if we have a bigger dataframe with more number of rows then you can see an incredible result with the Vectorization approach
Update the dataframe values using loc
We have to locate the row value first which meets the condtion using loc and then set the column value with the new values.
df.loc[((df['score'] > 40)&(df['score'] < 50)), ['grade'] ] = 'A+'
df
name | marks | Score | |
---|---|---|---|
0 | Alex | 50 | B+ |
1 | Harry | 42 | A+ |
2 | Brenden | 35 | B- |
3 | Alice | 68 | A+ |
4 | Ronald | 47 | A+ |
5 | Justin | 62 | A+ |
Let’s timeit to check which is fastest out of these 3 approaches to update the dataframe value without actually iterationg over the rows.
%%timeit
df.loc[((df['score'] > 40)&(df['score'] < 50)), ['grade'] ] = 'A+'
662 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Alright! We can see that Vectorization is the fastest approach out of all the three to update the dataframe, the runner-up is loc and the last in list is apply function.
Pandas iterate over rows of dataframe and update
Now we will see the pandas functions that can be used to iterate the rows and columns of a dataframe
We will use the same above dataframe(df) and the same condition to upgrade the grade of students where row condition is met, However this time we will iterate through the rows and columns of the dataframe to achieve this.
Dataframe.iterrows()
It iterates over Dataframe rows as (Index, Series) pairs and returns a series for each row and itt does not preserve dtypes across the rows
Let’s see the index of all the row values returned by itertuples, The index is the column label of the dataframe
for idx, row in df.iterrows():
print(row.index)
Out:
Index(['name', 'score', 'grade'], dtype='object')
Index(['name', 'score', 'grade'], dtype='object')
Index(['name', 'score', 'grade'], dtype='object')
Index(['name', 'score', 'grade'], dtype='object')
Index(['name', 'score', 'grade'], dtype='object')
Index(['name', 'score', 'grade'], dtype='object')
Let’s check the values of each of these rows returned by itertuples, the value for each row is returned as an array
for idx, row in df.iterrows():
print(row.values)
Out:
['Alex' 50 'B+']
['Harry' 42 'A+']
['Brenden' 35 'B-']
['Alice' 68 'A+']
['Ronald' 47 'A+']
['Justin' 62 'A+']
We can access the column value of each row to check the condition and use dataframe.at() function to set the value of columns which meet our condition.
You can read more about updating the cell value in a dataframe here.
for idx, row in df.iterrows():
if (row.score > 40) & (row.score < 50):
df.at[idx,'grade'] = 'A+'
df
Out:
name | marks | Score | |
---|---|---|---|
0 | Alex | 50 | B+ |
1 | Harry | 42 | A+ |
2 | Brenden | 35 | B- |
3 | Alice | 68 | A+ |
4 | Ronald | 47 | A+ |
5 | Justin | 62 | A+ |
Dataframe.itertuples()
It Iterates over DataFrame rows as namedtuple and returns an object to iterate over namedtuples for each row in the DataFrame with the first field being the index and following fields being the column values
If I just iterate through the return value of Itertuples
for row in df.itertuples():
print(row)
Out:
You can see here, the Pandas is the name of returned named tuple and the other values are the column value of each row
Pandas(Index=0, name='Alex', score=50, grade='B+')
Pandas(Index=1, name='Harry', score=42, grade='A+')
Pandas(Index=2, name='Brenden', score=35, grade='B-')
Pandas(Index=3, name='Alice', score=68, grade='A+')
Pandas(Index=4, name='Ronald', score=47, grade='A+')
Pandas(Index=5, name='Justin', score=62, grade='A+')
We could access the column using the dot notation and check the condition and use dataframe.at() to update the values of the column
for row in df.itertuples():
if (row.score > 40) & (row.score < 50):
df.at[row.Index,'grade'] = 'A+'
df
Out:
name | marks | Score | |
---|---|---|---|
0 | Alex | 50 | B+ |
1 | Harry | 42 | A+ |
2 | Brenden | 35 | B- |
3 | Alice | 68 | A+ |
4 | Ronald | 47 | A+ |
5 | Justin | 62 | A+ |
Dataframe.items()
It iterates over column name, Series pairs and returns a tuple with the column name and the content as a Series.
Let’s check the label of each item in the iterable.
for label, content in df.items():
print(content.name)
# or
print(label)
It throws out the column labels of the dataframe
Out:
name
score
grade
And, each column is stored as Series:
for label, content in df.items():
print(content.values)
Out:
['Alex' 'Harry' 'Brenden' 'Alice' 'Ronald' 'Justin']
[50 42 35 68 47 62]
['B+' 'A+' 'B-' 'A+' 'A+' 'A+']
We will get the index of each row which meets the condition and then update the value of the grade column of these rows using dataframe.at() function
for label, content in df.items():
if content.name == 'score':
for idx in np.where((content>40)&(content<50))[0]:
df.at[idx,'grade'] = 'A+'
df
name | marks | Score | |
---|---|---|---|
0 | Alex | 50 | B+ |
1 | Harry | 42 | A+ |
2 | Brenden | 35 | B- |
3 | Alice | 68 | A+ |
4 | Ronald | 47 | A+ |
5 | Justin | 62 | A+ |