Pandas: Splitting (Exploding) a column into multiple rows

#data #python

Recently, while working with on something in my office, I faced a small but interesting problem. I had to clean some data and the data was not normalized. In one of the columns, a single cell had multiple comma seperated values. I could not find out the distribution of how frequently the value was appearing without splitting these cells into individual cells of their own; creating new rows.

Example:

# Input data:
EmployeeId, City
001, Mumbai|Bangalore
002, Pune|Mumbai|Delhi
003, Mumbai|Bangalore
004, Mumbai|Pune
005, Bangalore
...

Do you see the challenge? To plot the histogram, of cities I had no choice to take the individual Series and work on it independently. So I decided to explode the City into multiple rows, so that the data becomes like this:

# Expected format
EmployeeId, City
001, Mumbai
001, Bangalore
002, Pune
002, Mumbai
002, Delhi
003, Mumbai
003, Bangalore
004, Mumbai
004, Pune
005, Bangalore

Apparently, there is no straightforward way to clean this in Pandas, but it’s not that difficult either.

import pandas as pd

# Import the data
df = pd.DataFrame({
   'EmployeeId': ['001', '002', '003', '004', '005'],
   'City': ['Mumbai|Bangalore', 'Pune|Mumbai|Delhi', 'Mumbai|Bangalore', 'Mumbai|Pune', 'Bangalore'] 
})

# Step 1
# We start with creating a new dataframe from the series with EmployeeId as the index
new_df = pd.DataFrame(df.City.str.split('|').tolist(), index=df.EmployeeId).stack()

# Step 2
# We now want to get rid of the secondary index
# To do this, we will make EmployeeId as a column (it can't be an index since the values will be duplicate)
new_df = new_df.reset_index([0, 'EmployeeId'])

# Step 3
# The final step is to set the column names as we want them
new_df.columns = ['EmployeeId', 'City']

# Result
 EmployeeId City
0 	001 	Mumbai
1 	001 	Bangalore
2 	002 	Pune
3 	002 	Mumbai
4 	002 	Delhi
5 	003 	Mumbai
6 	003 	Bangalore
7 	004 	Mumbai
8 	004 	Pune
9 	005 	Bangalore

See the Jupyter Notebook [here].(https://gist.github.com/sureshsarda/00c3b7423ea7b6cba4250a719d6b7424)

Explanation

Step 1 is the real trick here, the other 2 steps are more of cleaning exercises to get the data into correct format. In Step 1, we are asking Pandas to split the series into multiple values and the combine all of them into single column using the stack method. The output of Step 1 without stack looks like this:

            0 	        1 	        2
EmployeeId 			
001 	    Mumbai 	    Bangalore 	None
002 	    Pune 	    Mumbai      Delhi
003 	    Mumbai 	    Bangalore 	None
004 	    Mumbai 	    Pune        None
005 	    Bangalore       None        None

After this, we stack these individual indexes into one and do some cleaning using reset_index and change the column names.

Final Code Snippet

# Explode/Split column into multiple rows
new_df = pd.DataFrame(df.City.str.split('|').tolist(), index=df.EmployeeId).stack()
new_df = new_df.reset_index([0, 'EmployeeId'])
new_df.columns = ['EmployeeId', 'City']

Read more