Pandas compare columns in two data frames
We have two dataframes and a common column that we want to compare and find out the matching, missing values and sometimes the difference between the values using a key
We would first concatenate the two dataframes into one and see how the two dataframes looks side by side and then find out the differences between them.
However if you are interested to find the difference between two dataframes then read this post
We will follow the following steps to find the difference between a column in two dataframes:
- Create two dataframes - df1 and df2
- Concatenate two dataframes side by side
- Compare the column in two dataframes on a common key
- Additionally, find the matching rows between two dataframe
- find the non-matching rows between the dataframes
Let’s get started, we will first create two test dataframes(df1 & df2) to work upon
Create Two Dataframes
First Dataframe:
The first dataframe has 2 columns: Items and Sale
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['A', 1], ['B', 2]],
columns=['Items', 'Sale'])
df1
Items | Sale |
---|---|
A | 200 |
B | 410 |
Second Dataframe:
The second dataframe has three columns: Items, Sale and Category
df2 = pd.DataFrame([['A', 320, 'food'], ['B', 320, 'home'], ['C', 530, 'furniture']],
columns=['Items', 'Sale', 'Category'])
df2
Items | Sale | Category |
---|---|---|
A | 320 | Food |
B | 550 | Home |
C | 530 | Furniture |
Concatenate the two dataframes
We have concatenated the two dataframes(df1 and df2) and can see them side by side and the final concatenated dataframe is stored in variable df
df=pd.concat([df1, df2],axis=1, keys = ['df1', 'df2'])
df
df1 | df2 | ||||
---|---|---|---|---|---|
Items | Sale | Items | Sales | Category | |
0 | A | 200 | A | 320 | Food |
1 | B | 410 | B | 550 | Home |
2 | NaN | NaN | C | 530 | Furniture |
Compare the columns in two dataframe
We will find the difference between the sales value between two dataframe for each of the Items
We have added a new column called as sales-diff to find the differences between the sales value in two dataframes where the Item values are similar otherwise difference is set to 0.
numpy.where() is used to return choice depending on condition
df['sales-diff']=np.where(df['df1']['Items']==df['df2']['Items'],
(df['df1']['Sale']-df['df2']['Sale']),
0)
We’ve got a new column that shows exactly the difference between the Sales column between df2 and df1
df1 | df2 | sales-diff | ||||
---|---|---|---|---|---|---|
Items | Sale | Items | Sales | Category | ||
0 | A | 200 | A | 320 | Food | 120 |
1 | B | 410 | B | 550 | Home | 140 |
2 | NaN | NaN | C | 530 | Furniture | 0 |
Non-matching rows between two dataframes
Let’s find the rows not matching between two dataframes(df1 and df2) based on column Items i.e. Elements of Series df1[‘Items’] which are not in df2[‘Items’]
df[~df['df1']['Items'].isin(df['df2']['Items'])]
df1 | df2 | sales-diff | ||||
---|---|---|---|---|---|---|
Items | Sale | Items | Sales | Category | ||
2 | NaN | NaN | C | 530 | Furniture | 0 |
Matching rows between two dataframes
We will find the rows matching between the two dataframes(df1 and df2) based on column Items i.e. Elements of Series df1[‘Items’] which are in df2[‘Items’]
df[df['df1']['letter']==df['df2']['letter']]
df1 | df2 | sales-diff | ||||
---|---|---|---|---|---|---|
Items | Sale | Items | Sales | Category | ||
0 | A | 200 | A | 320 | Food | 120 |
1 | B | 410 | B | 550 | Home | 140 |
Alternatively, we can use pandas.merge() to merge the two dataframes(df1 and df2) on column Items and apply inner join, use intersection of keys from both dataframes, similar to a SQL inner join and preserve the order of the left keys
pd.merge(df1, df2, on='Items', how='inner')
Items | Sale_x | Sale_y | Category | |
---|---|---|---|---|
0 | A | 200 | 320 | Food |
1 | B | 410 | 550 | Home |