Pandas compare columns in two data frames

2 minute read

We have two dataframes and a common column that we want to compare and find out the matching, missing values and sometimes the difference between the values using a key

We would first concatenate the two dataframes into one and see how the two dataframes looks side by side and then find out the differences between them.

However if you are interested to find the difference between two dataframes then read this post

We will follow the following steps to find the difference between a column in two dataframes:

Create two dataframes - df1 and df2
Concatenate two dataframes side by side
Compare the column in two dataframes on a common key
Additionally, find the matching rows between two dataframe
find the non-matching rows between the dataframes

Let’s get started, we will first create two test dataframes(df1 & df2) to work upon

Create Two Dataframes

First Dataframe:

The first dataframe has 2 columns: Items and Sale

import pandas as pd
import numpy as np

df1 = pd.DataFrame([['A', 1], ['B', 2]],
                    columns=['Items', 'Sale'])
df1

Items	Sale
A	200
B	410

Second Dataframe:

The second dataframe has three columns: Items, Sale and Category

df2 = pd.DataFrame([['A', 320, 'food'], ['B', 320, 'home'], ['C', 530, 'furniture']],
                    columns=['Items', 'Sale', 'Category'])
df2

Items	Sale	Category
A	320	Food
B	550	Home
C	530	Furniture

Concatenate the two dataframes

We have concatenated the two dataframes(df1 and df2) and can see them side by side and the final concatenated dataframe is stored in variable df

df=pd.concat([df1, df2],axis=1, keys = ['df1', 'df2'])
df

	df1		df2
	Items	Sale	Items	Sales	Category
0	A	200	A	320	Food
1	B	410	B	550	Home
2	NaN	NaN	C	530	Furniture

Compare the columns in two dataframe

We will find the difference between the sales value between two dataframe for each of the Items

We have added a new column called as sales-diff to find the differences between the sales value in two dataframes where the Item values are similar otherwise difference is set to 0.

numpy.where() is used to return choice depending on condition

df['sales-diff']=np.where(df['df1']['Items']==df['df2']['Items'],
                    (df['df1']['Sale']-df['df2']['Sale']),
                    0)

We’ve got a new column that shows exactly the difference between the Sales column between df2 and df1

	df1		df2			sales-diff
	Items	Sale	Items	Sales	Category
0	A	200	A	320	Food	120
1	B	410	B	550	Home	140
2	NaN	NaN	C	530	Furniture	0

Non-matching rows between two dataframes

Let’s find the rows not matching between two dataframes(df1 and df2) based on column Items i.e. Elements of Series df1[‘Items’] which are not in df2[‘Items’]

df[~df['df1']['Items'].isin(df['df2']['Items'])]

	df1		df2			sales-diff
	Items	Sale	Items	Sales	Category
2	NaN	NaN	C	530	Furniture	0

Matching rows between two dataframes

We will find the rows matching between the two dataframes(df1 and df2) based on column Items i.e. Elements of Series df1[‘Items’] which are in df2[‘Items’]

df[df['df1']['letter']==df['df2']['letter']]

	df1		df2			sales-diff
	Items	Sale	Items	Sales	Category
0	A	200	A	320	Food	120
1	B	410	B	550	Home	140

Alternatively, we can use pandas.merge() to merge the two dataframes(df1 and df2) on column Items and apply inner join, use intersection of keys from both dataframes, similar to a SQL inner join and preserve the order of the left keys

pd.merge(df1, df2, on='Items', how='inner')

	Items	Sale_x	Sale_y	Category
0	A	200	320	Food
1	B	410	550	Home

Share on

Twitter Facebook LinkedIn

Pandas compare columns in two data frames

Create Two Dataframes

Concatenate the two dataframes

Compare the columns in two dataframe

Non-matching rows between two dataframes

Matching rows between two dataframes

Share on

You may also enjoy

pandas count duplicate rows

Pandas value error while merging two dataframes with different data types

How to get True Positive, False Positive, True Negative and False Negative from confusion matrix in scikit learn

Pandas how to use list of values to select rows from a dataframe