Pandas Split-Apply-Combine Example
There are times when I want to use split-apply-combine to save the results of a
groupby to a json file while preserving the resulting column values as a list.
Before we start, let’s import Pandas and generate a dataframe with some example
email data
Import Pandas and Create an Email DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'Sender': ['Alice', 'Alice', 'Bob', 'Carl', 'Bob', 'Alice'],
'Receiver': ['David', 'Eric', 'Frank', 'Ginger', 'Holly', 'Ingrid'],
'Emails': [9, 3, 5, 1, 6, 7]
})
df
Emails | Receiver | Sender | |
---|---|---|---|
0 | 9 | David | Alice |
1 | 3 | Eric | Alice |
2 | 5 | Frank | Bob |
3 | 1 | Ginger | Carl |
4 | 6 | Holly | Bob |
5 | 7 | Ingrid | Alice |
Purpose
Here, the goal is to generate a list of e-mail receivers for each sender. To
accomplish this, we can first group the data by sender:
grouped = df.groupby('Sender')
And if we examine the groups for each sender then we’ll see the row indices that
are associated with each sender:
grouped.groups
{'Alice': [0, 1, 5], 'Bob': [2, 4], 'Carl': [3]}
We usually follow a groupby call by aggregating data along the other columns.
For example:
grouped.aggregate(np.sum)
Emails | |
---|---|
Sender | |
Alice | 19 |
Bob | 11 |
Carl | 1 |
However, notice that the e-mail receiver column doesn’t get aggregated since it
isn’t obvious how to sum up text together or what that would even mean. In this
case, we could provide a custom aggregator that simply concatenates all of the
receivers into a single string.
grouped['Receiver'].agg(lambda x: x.sum())
But how can we end up with a list of receivers?
The Secret Sauce
Well, we could take the custom aggregator a step further and have it return a
list instead of a concatenated string.
grouped['Receiver'].agg({'Receivers':(lambda x: list(x))})
Receivers | |
---|---|
Sender | |
Alice | [David, Eric, Ingrid] |
Bob | [Frank, Holly] |
Carl | [Ginger] |
But note that we’ve excluded the number of e-mails for clarity. Putting it all
together, we can incorporate all of the data and do something like this:
grouped.agg({'Receiver': {'Receiver_List': (lambda x: list(x))},
'Emails': {'Total_Emails': np.sum}
})
Emails | Receiver | |
---|---|---|
Total_Emails | Receiver_List | |
Sender | ||
Alice | 19 | [David, Eric, Ingrid] |
Bob | 11 | [Frank, Holly] |
Carl | 1 | [Ginger] |
Above, the agg
function takes on the format {column: {name: agg_func}}
where
column
is the dataframe column, name
is the column name of the resulting
aggregation result, and agg_func
is the name of the aggregation function to
use. Note that we can simplify things by omitting the name
completely and
breaking free of the inner dict
with:
out = grouped.agg({'Receiver': (lambda x: list(x)),
'Emails': np.sum
})
out
Emails | Receiver | |
---|---|---|
Sender | ||
Alice | 19 | [David, Eric, Ingrid] |
Bob | 11 | [Frank, Holly] |
Carl | 1 | [Ginger] |
Finally, the dataframe can be written to a json file
out.to_json('easy_as_pie.json')
{"Emails": {"Alice":19,"Bob":11,"Carl":1},
"Receiver": {"Alice":["David","Eric","Ingrid"],
"Bob":["Frank","Holly"],
"Carl":["Ginger"]}
}