Pandas Split-Apply-Combine Example
There are times when I want to use split-apply-combine to save the results of a
groupby to a json file while preserving the resulting column values as a list.
Before we start, let’s import Pandas and generate a dataframe with some example
email data
Import Pandas and Create an Email DataFrame
Emails | Receiver | Sender | |
---|---|---|---|
0 | 9 | David | Alice |
1 | 3 | Eric | Alice |
2 | 5 | Frank | Bob |
3 | 1 | Ginger | Carl |
4 | 6 | Holly | Bob |
5 | 7 | Ingrid | Alice |
Purpose
Here, the goal is to generate a list of e-mail receivers for each sender. To
accomplish this, we can first group the data by sender:
And if we examine the groups for each sender then we’ll see the row indices that
are associated with each sender:
We usually follow a groupby call by aggregating data along the other columns.
For example:
Emails | |
---|---|
Sender | |
Alice | 19 |
Bob | 11 |
Carl | 1 |
However, notice that the e-mail receiver column doesn’t get aggregated since it
isn’t obvious how to sum up text together or what that would even mean. In this
case, we could provide a custom aggregator that simply concatenates all of the
receivers into a single string.
But how can we end up with a list of receivers?
The Secret Sauce
Well, we could take the custom aggregator a step further and have it return a
list instead of a concatenated string.
Receivers | |
---|---|
Sender | |
Alice | [David, Eric, Ingrid] |
Bob | [Frank, Holly] |
Carl | [Ginger] |
But note that we’ve excluded the number of e-mails for clarity. Putting it all
together, we can incorporate all of the data and do something like this:
Emails | Receiver | |
---|---|---|
Total_Emails | Receiver_List | |
Sender | ||
Alice | 19 | [David, Eric, Ingrid] |
Bob | 11 | [Frank, Holly] |
Carl | 1 | [Ginger] |
Above, the agg
function takes on the format {column: {name: agg_func}}
where
column
is the dataframe column, name
is the column name of the resulting
aggregation result, and agg_func
is the name of the aggregation function to
use. Note that we can simplify things by omitting the name
completely and
breaking free of the inner dict
with:
Emails | Receiver | |
---|---|---|
Sender | ||
Alice | 19 | [David, Eric, Ingrid] |
Bob | 11 | [Frank, Holly] |
Carl | 1 | [Ginger] |
Finally, the dataframe can be written to a json file