Pandas Split-Apply-Combine Example

There are times when I want to use split-apply-combine to save the results of a groupby to a json file while preserving the resulting column values as a list. Before we start, let’s import Pandas and generate a dataframe with some example email data

Import Pandas and Create an Email DataFrame

import pandas as pd
import numpy as np

df = pd.DataFrame({'Sender': ['Alice', 'Alice', 'Bob', 'Carl', 'Bob', 'Alice'],
                   'Receiver': ['David', 'Eric', 'Frank', 'Ginger', 'Holly', 'Ingrid'],
                   'Emails': [9, 3, 5, 1, 6, 7]
                  })
df

	Emails	Receiver	Sender
0	9	David	Alice
1	3	Eric	Alice
2	5	Frank	Bob
3	1	Ginger	Carl
4	6	Holly	Bob
5	7	Ingrid	Alice

Purpose

Here, the goal is to generate a list of e-mail receivers for each sender. To accomplish this, we can first group the data by sender:

grouped = df.groupby('Sender')

And if we examine the groups for each sender then we’ll see the row indices that are associated with each sender:

grouped.groups

{'Alice': [0, 1, 5], 'Bob': [2, 4], 'Carl': [3]}

We usually follow a groupby call by aggregating data along the other columns. For example:

grouped.aggregate(np.sum)

	Emails
Sender
Alice	19
Bob	11
Carl	1

However, notice that the e-mail receiver column doesn’t get aggregated since it isn’t obvious how to sum up text together or what that would even mean. In this case, we could provide a custom aggregator that simply concatenates all of the receivers into a single string.

grouped['Receiver'].agg(lambda x: x.sum())

But how can we end up with a list of receivers?

The Secret Sauce

Well, we could take the custom aggregator a step further and have it return a list instead of a concatenated string.

grouped['Receiver'].agg({'Receivers':(lambda x: list(x))})

	Receivers
Sender
Alice	[David, Eric, Ingrid]
Bob	[Frank, Holly]
Carl	[Ginger]

But note that we’ve excluded the number of e-mails for clarity. Putting it all together, we can incorporate all of the data and do something like this:

grouped.agg({'Receiver': {'Receiver_List': (lambda x: list(x))},
             'Emails': {'Total_Emails': np.sum}
            })

	Emails	Receiver
	Total_Emails	Receiver_List
Sender
Alice	19	[David, Eric, Ingrid]
Bob	11	[Frank, Holly]
Carl	1	[Ginger]

Above, the agg function takes on the format {column: {name: agg_func}} where column is the dataframe column, name is the column name of the resulting aggregation result, and agg_func is the name of the aggregation function to use. Note that we can simplify things by omitting the name completely and breaking free of the inner dict with:

out = grouped.agg({'Receiver': (lambda x: list(x)),
                   'Emails': np.sum
                  })
out

	Emails	Receiver
Sender
Alice	19	[David, Eric, Ingrid]
Bob	11	[Frank, Holly]
Carl	1	[Ginger]

Finally, the dataframe can be written to a json file

out.to_json('easy_as_pie.json')

{"Emails": {"Alice":19,"Bob":11,"Carl":1},
 "Receiver": {"Alice":["David","Eric","Ingrid"],
              "Bob":["Frank","Holly"],
              "Carl":["Ginger"]}
}

Easy-As-Pie!

Update

Here are some ways to write this output (including when you have integers or floating point values instead of strings) to a jagged file with a space delimiter

Published

May 28, 2016