I'm Melting! From Wide to Long Format and Quarterly Groupby
Recently, a colleague of mine asked me how one might go about taking a dataset
that is in wide format and converting it into long format so that you could then
perform some groupby operations by quarter.
Here’s a quick example to illustrate one way to go about this using the Pandas
melt function.
Getting Started
Let’s import the Pandas package
Load Some Data
First, we’ll create a fake dataframe that contains the name of a state and city
along with some data for each month in the year 2000. For simplicity, imagine
that the data are the number of Canadians spotted eating poutine.
state | city | 2000-01 | 2000-02 | 2000-03 | 2000-04 | 2000-05 | 2000-06 | 2000-07 | 2000-08 | 2000-09 | 2000-10 | 2000-11 | 2000-12 | 2001-01 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NY | New York | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
1 | MI | Ann Arbor | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
2 | OR | Portland | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
The Goal
The problem was to take this dataset and try to find the average number of
Canadians eating poutine in each city and state for each quarter in the year
2000.
First, we need to convert the original dataframe into long format and then make
sure that the resulting months are understood to be a datetime column.
state | city | month | number of poutine eaters | |
---|---|---|---|---|
0 | NY | New York | 2000-01-01 | 1 |
1 | MI | Ann Arbor | 2000-01-01 | 5 |
2 | OR | Portland | 2000-01-01 | 9 |
3 | NY | New York | 2000-02-01 | 2 |
4 | MI | Ann Arbor | 2000-02-01 | 6 |
5 | OR | Portland | 2000-02-01 | 10 |
6 | NY | New York | 2000-03-01 | 3 |
7 | MI | Ann Arbor | 2000-03-01 | 7 |
8 | OR | Portland | 2000-03-01 | 11 |
9 | NY | New York | 2000-04-01 | 4 |
10 | MI | Ann Arbor | 2000-04-01 | 8 |
11 | OR | Portland | 2000-04-01 | 12 |
12 | NY | New York | 2000-05-01 | 5 |
13 | MI | Ann Arbor | 2000-05-01 | 9 |
14 | OR | Portland | 2000-05-01 | 13 |
15 | NY | New York | 2000-06-01 | 6 |
16 | MI | Ann Arbor | 2000-06-01 | 10 |
17 | OR | Portland | 2000-06-01 | 14 |
18 | NY | New York | 2000-07-01 | 7 |
19 | MI | Ann Arbor | 2000-07-01 | 11 |
20 | OR | Portland | 2000-07-01 | 15 |
21 | NY | New York | 2000-08-01 | 8 |
22 | MI | Ann Arbor | 2000-08-01 | 12 |
23 | OR | Portland | 2000-08-01 | 16 |
24 | NY | New York | 2000-09-01 | 9 |
25 | MI | Ann Arbor | 2000-09-01 | 13 |
26 | OR | Portland | 2000-09-01 | 17 |
27 | NY | New York | 2000-10-01 | 10 |
28 | MI | Ann Arbor | 2000-10-01 | 14 |
29 | OR | Portland | 2000-10-01 | 18 |
30 | NY | New York | 2000-11-01 | 11 |
31 | MI | Ann Arbor | 2000-11-01 | 15 |
32 | OR | Portland | 2000-11-01 | 19 |
33 | NY | New York | 2000-12-01 | 12 |
34 | MI | Ann Arbor | 2000-12-01 | 16 |
35 | OR | Portland | 2000-12-01 | 20 |
36 | NY | New York | 2001-01-01 | 13 |
37 | MI | Ann Arbor | 2001-01-01 | 17 |
38 | OR | Portland | 2001-01-01 | 21 |
The First Attempt
Let’s see what happens when we simply groupby the city and state:
number of poutine eaters | ||
---|---|---|
state | city | |
MI | Ann Arbor | 11 |
NY | New York | 7 |
OR | Portland | 15 |
But this only gives you the annual average by city and state. Instead, we want
to be able to split the months out by quarter. Now, you can do some special
tricks by mapping each month into a quarter (i.e., 01, 02, and 03 are mapped to
Q1 while 10, 11, and 12 are mapped to Q4) and then using that new column to
groupby. However, we can make use of the fantastic built in PeriodIndex function
to do all of the work for us.
The Final Step
number of poutine eaters | |||
---|---|---|---|
state | city | month | |
MI | Ann Arbor | 2000Q1 | 6 |
2000Q2 | 9 | ||
2000Q3 | 12 | ||
2000Q4 | 15 | ||
2001Q1 | 17 | ||
NY | New York | 2000Q1 | 2 |
2000Q2 | 5 | ||
2000Q3 | 8 | ||
2000Q4 | 11 | ||
2001Q1 | 13 | ||
OR | Portland | 2000Q1 | 10 |
2000Q2 | 13 | ||
2000Q3 | 16 | ||
2000Q4 | 19 | ||
2001Q1 | 21 |
That’s it! Pretty Wicked, right? Let me know what you think in the comments below.