How to implode in pandas - reverse of explode

2 minute read

In this post we are going to see how to perform reverse of explode

We will be following the below steps to implode a column in the dataframe:

Create a dataframe
Group the dataframe using desired columns
Use Aggregate function to create list of values in a column for each group

Create Dataframe

Let’s create a dataframe with five columns - Area, Sales_million, Month, Gallons and Rank.

df=pd.DataFrame({'Area': ['North America']*5,
                 'Region': ['Midwest', 'Northeast', 
                            'Southwest', 'Midwest', 
                            'Northeast']
             'Sales_million': [35, 45, 50, 45, 50],
             'Month': ['Jan', 'Feb', 'Mar', 'Apr', "May"],
             'Gallons': [200, 215, 320, 220, 195],
             'Rank': [5, 2, 1, 4, 3]})
df

Out:

	Area	Region	Sales_million	Month	Gallons	Rank
0	North America	Midwest	35	Jan	200	5
1	North America	Northeast	45	Feb	215	2
2	North America	Southwest	50	Mar	320	1
3	North America	Midwest	45	Apr	220	4
4	North America	Northeast	50	May	195	3

Use Group by to aggregate columns

We want to group the dataframe on Area and Region and aggregate the values in column Gallons, Rank and Sales_million, and combine the value of Month in all rows into a single list

we will pass the list function for Month column in Groupby aggregate function, It will combine the value of Month column in one single list.

a) Aggregate columns and combine the values as list

(df.groupby(['Area','Region'])
      .agg({'Month': list,
            'Gallons':'mean',
            'Rank':'mean', 
            'Sales_million': 'sum'})
      .reset_index())

Out:

	Area	Region	Month	Gallons	Rank	Sales_million
0	North America	Midwest	[Jan, Apr]	210	4.5	80
1	North America	Northeast	[Feb, May]	205	2.5	95
2	North America	Southwest	[Mar]	320	1.0	50

b) Rename columns

We will rename the columns to make it more meaningful and self explanatory.

NamedAggregation is used to to make it clearer what the arguments are and to control the output names with different aggregations per column

The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column.

df1=(df.groupby(['Area','Region'])
      .agg(
         Month_list = 
            pd.NamedAgg(column = 'Month', 
                        aggfunc = list),
         Avg_gallon = 
            pd.NamedAgg(column = 'Gallons', 
                        aggfunc = 'mean'),
         Avg_rank = 
            pd.NamedAgg(column = 'Rank', 
                        aggfunc = 'mean'),
         Sum_sales_million = 
            pd.NamedAgg(column = 'Sales_million', 
                        aggfunc = 'sum')
      )
      .reset_index())
df1

Out:

	Area	Region	Month_list	Avg_gallon	Avg_rank	Sum_sales_million
0	North America	Midwest	[Jan, Apr]	210	4.5	80
1	North America	Northeast	[Feb, May]	205	2.5	95
2	North America	Southwest	[Mar]	320	1.0	50

b) Combine the values across multiple columns as list

we want to combine the values of Avg_gallon, Avg_rank and Sum_sales_million into a single list. In short we want to combine the values of three columns into a single list horizontally

cols = ['Avg_gallon', 'Avg_rank', 'Sum_sales_million']
df1['group_gallon_rank_sales'] = df1[cols].values.tolist()
df1.drop(cols, axis=1)

Out:

	Area	Region	Month_list	group_gallon_rank_sales
0	North America	Midwest	[Jan, Apr]	[210.0, 4.5, 80.0]
1	North America	Northeast	[Feb, May]	[205.0, 2.5, 95.0]
2	North America	Southwest	[Mar]	[320.0, 1.0, 50.0]

Share on

Twitter Facebook LinkedIn

How to implode in pandas - reverse of explode

Create Dataframe

Use Group by to aggregate columns

Share on

You may also enjoy

pandas count duplicate rows

Pandas value error while merging two dataframes with different data types

How to get True Positive, False Positive, True Negative and False Negative from confusion matrix in scikit learn

Pandas how to use list of values to select rows from a dataframe