Pandas merge on Index and merge dataframe on Index & column
In this post we will discuss how to merge two dataframes either on their Index or on Index & column, Pandas has a merge API with lot of parameters to make this job really simpler.
There are other pandas API’s such as Join, concat etc. but in this post we are focusing just on merge.
Pandas DataFrame merge() function is used to merge two DataFrame objects with a database-style join operation. The joining is performed on columns or indexes where rows share the data.
We will see how to merge dataframes with different index type, A dataframe can have various type of Indexes such as Categorical, DatetimeIndex, TimeDeltaIndex, MultiIndex and many others.
1. Merge on Index
a) Merge on DatetimeIndex:
We want to merge our two dataframes on their DatetimeIndex, This is quite similar to a database join where you can do a left, right or outer join on one or more columns.
Let’s create two dataframes with DatetimeIndex.
First Dataframe:
df1 = pd.DataFrame(
{'lkey': ['foo',
'bar',
'baz',
'foo'],
'value': ['Classic',
'Gold',
'Platinum',
'Silver']},
index = DatetimeIndex(["2021-12-31",
"2022-01-02",
"2022-03-03",
"2022-04-04"])
)
df1
lkey | value | |
---|---|---|
2021-12-31 | foo | Classic |
2022-01-02 | bar | Gold |
2022-03-03 | baz | Platinum |
2022-04-04 | foo | silver |
Second Dataframe:
df2 = pd.DataFrame(
{'rkey': ['bar',
'bar',
'baz',
'foo'],
'value': ['Classic',
'Gold',
'Platinum',
'Silver'], },
index = DatetimeIndex(["2022-05-05",
"2022-02-02",
"2022-03-03",
"2022-04-04"])
)
df2
lkey | value | |
---|---|---|
2022-05-05 | bar | Classic |
2022-02-02 | bar | Gold |
2022-03-03 | baz | Platinum |
2022-04-04 | foo | Silver |
Merge - Inner Join:
We want to merge the two dataframes(df1 and df2) on their DatetimeIndex, if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
This is an inner join since we are not explicitly setting the how parameter to tell type of merge. It’s the default choice.
We have passed the above two dataframes as left(df1) and right(df2) dataframe that needs to be merged.
There is a left_index and right_index parameter which is set to True, which tells to use the index from left(df1) & right(df2) dataframe respectively for the merge operation
pd.merge(df1,df2,left_index=True, right_index=True)
The default suffix is _x and _y for left and right dataframe. The common column “value” is appended with their suffixes in the merged dataframe.
key | value_x | rkey | value_y | |
---|---|---|---|---|
2022-03-03 | baz | Platinum | baz | Platinum |
2022-04-04 | foo | Silver | foo | Silver |
Merge - left join:
We will do a left join on index using the merge function, the how parameter is used to specify the type of merge, by default it’s inner
The left merge use only keys from left frame, similar to a SQL left outer join; preserve key order.
pd.merge(df1,df2,left_index=True, right_index=True, how='left', indicator=True)
There is a new parameter which I added here “indicator”, If True, adds a column to the output DataFrame called “_merge” with information on the source of each row
The merge column shows that the first two rows are from the left dataframe(df1)
lkey | value_x | rkey | value_y | merge | |
---|---|---|---|---|---|
2021-12-31 | foo | Classic | NaN | NaN | left_only |
2022-01-02 | bar | Gold | NaN | NaN | left_only |
2022-03-03 | baz | Platinum | baz | Platinum | both |
2022-04-04 | foo | Silver | foo | Silver | both |
b) Merge on Categorical Index:
Let’s take an another example of Dataframes with Categorical Index
First Dataframe:
df3 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [7, 8, 32, 51]},
index = CategoricalIndex(['a', 'b', 'c', 'd']))
df3
lkey | value | |
---|---|---|
a | foo | 7 |
b | bar | 8 |
c | baz | 32 |
D | foo | 51 |
Second Dataframe:
df4 = pd.DataFrame({'rkey': ['bar', 'bar', 'baz', 'foo'],
'value': ['a', 'b', 'e', 'f'], },
index=CategoricalIndex(['a', 'b', 'c', 'e']))
df4
rkey | value | |
---|---|---|
a | bar | a |
b | bar | b |
c | baz | e |
e | foo | f |
Merge - Inner Join:
Here we want to merge the two dataframes(df3 & df4) on their CategoricalIndex, we will pass the two dataframes as parameters with left_index and right_index set=True for merge on Index
pd.merge(df3,df4,left_index=True, right_index=True)
lkey | value_x | rkey | value_y | |
---|---|---|---|---|
a | foo | 7 | bar | a |
b | bar | 8 | bar | b |
c | baz | 32 | baz | e |
Merge - right join:
we want to do a right join on index using the merge function, the how parameteris set to “right” and we are passing indicator value as True
The right merge use only keys from right frame, similar to a SQL right outer join; preserve key order.
pd.merge(df3,df4,left_index=True, right_index=True, how='right',indicator = True)
The merge column shows there is only last row which is from the right dataframe(df4)
lkey | value_x | rkey | value_y | merge | |
---|---|---|---|---|---|
a | foo | 7 | bar | a | both |
b | bar | 8 | bar | b | both |
c | baz | 32 | baz | e | both |
d | NaN | NaN | foo | f | right_only |
2. Merge on Index and Column
In this section, we want to merge the dataframes on Index and columns.
Let’s create the two dataframes:
df5 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [7, 8, 32, 51]})
df5
lkey | value | |
---|---|---|
0 | foo | 7 |
1 | bar | 8 |
2 | baz | 32 |
3 | foo | 51 |
df6 = pd.DataFrame({'rkey': ['bar', 'bar', 'baz', 'foo'],
'value': [0, 7, 32, 1], })
df6
rkey | value | |
---|---|---|
a | bar | 0 |
b | bar | 7 |
c | baz | 32 |
e | foo | 1 |
Merge - Inner Join:
We want to merge on the index of left dataframe(df5) and “value” column of right dataframe(df6)
The merge API has two parameters: left_on and right_on to specify Column or index level names to join on in the left DataFrame.
These two parameters can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
pd.merge(df5,df6,left_index=True, right_on='value')
This is an inner-join by default since we have not provided the type of merge explicitly here
Value | lkey | value_x | rkey | value_y | |
---|---|---|---|---|---|
0 | 0 | foo | 7 | bar | 0 |
3 | 1 | bar | 8 | foo | 1 |
Merge - Outer Join
We want to perfom an Outer join on index of left dataframe(df5) and “value” column of right dataframe(df6)
We have specified the join as outer by passing it to how parameter and indicator is set to true.
# Index and column - outer join
pd.merge(df5,df6,left_index=True, right_on='value', how='outer', indicator = True)
The _merge column indicates the source of each row, since this is an outer join we have non-matching rows from both the dataframes
For the rows in left dataframe(df5), which doesn’t have merge keys the index is NaN and also the column values of right dataframe for those rows are NaN
value | lkey | value_x | rkey | value_y | _merge | |
---|---|---|---|---|---|---|
0.0 | 0 | foo | 7.0 | bar | 0.0 | both |
3.0 | 1 | bar | 8.0 | foo | 1.0 | both |
NaN | 2 | baz | 32.0 | NaN | NaN | left_only |
NaN | 3 | foo | 51.0 | NaN | NaN | left_only |
1.0 | 7 | NaN | NaN | bar | 7.0 | right_only |
2.0 | 32 | NaN | NaN | baz | 32.0 | right_only |
Merge - Left Join
We want to perfom a left outer join on index of left dataframe(df5) and “value” column of right dataframe(df6)
# Index and column - left join
pd.merge(df5,df6,left_index=True, right_on='value', how='left', indicator = True)
The merge column indicates the rows which doesn’t have merge keys are coming from left dataframe(df5) only.
value | lkey | value_x | rkey | value_y | _merge | |
---|---|---|---|---|---|---|
0.0 | 0 | foo | 7.0 | bar | 0.0 | both |
3.0 | 1 | bar | 8.0 | foo | 1.0 | both |
NaN | 2 | baz | 32.0 | NaN | NaN | left_only |
NaN | 3 | foo | 51.0 | NaN | NaN | left_only |
3. Merge dataframe on common column
a) Merge using on parameter
We want to merge the dataframes on common columns, pandas merge function has a “on” parameter for it
We can pass the Column or index level names to join on. These must be found in both DataFrames.
If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
Let’s see how it works, We will merge the above two dataframes(df5 & df6) on their common column “value”
# Using on
pd.merge(df5,df6, on=['value'])
Even if we don’t pass the on parameter, it will perform an inner-join on the common column
lkey | value | Key | |
---|---|---|---|
0 | foo | 7 | bar |
3 | baz | 32 | Baz |
Let’s take another dataframe (df1 & df2) from above and merge “on” parameter set to the column “value” which is common column between both the dataframe
# Using on
pd.merge(df1,df2, on=['value'])
lkey | value | Key | |
---|---|---|---|
0 | foo | Classic | bar |
1 | bar | Gold | bar |
2 | baz | Platinum | baz |
3 | foo | Silver | foo |
We can also use the how
parameter to perform other merge operations such as left, right and outer.