Leveraging Sparse Arrays for Large-ish (Feature) Data Sets
When building a machine learning model, you’ll often have your feature data
stored in one or more database tables (e.g., with a couple of million rows and
maybe a thousand columns) where each row represents, say, an individual user and
each column represents some feature that you’ve engineered (i.e., number of
logins to your site, number of daily purchases, number of ads clicked on, 52
consecutive columns storing a weekly moving average over a year, etc). Since
storage is cheap and you can’t afford to pay for a compute cluster, your
instinct might be to download the full table onto your laptop as a quick and
dirty CSV file or, possibly, in a more portable and compact parquet format.
However, depending on your local computer hardware, trying to load that full
data set into a Pandas DataFrame might consume all of your memory and cause your
system to grind to a halt!
Luckily, there is one approach that a lot of people often overlook but that
might work wonderfully for you. The secret is to exploit the inherent sparsity
that likely exists within your large data set!
As an illustrative example, let’s consider the following list of ones and zeros:
[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
To capture all of the information to represent this list, all that you really
need is:
- The total length of the list
- The exact position of each
1
(the position of each0
is implied)
In other words, since the majority of the list is full of zeros, we can start by
assuming that the entire list is full of zeros with the exception of the
locations of the 1
s. In doing so, we end up only needing to record a few bits
of information and thereby saving us a ton of memory to do other interesting
things. The number of nonzero elements that exist in your matrix/array also
defines the density
of your matrix. That is, the more nonzero elements that
you have (relative to the total number of elements) then the higher the density.
A sparse matrix/array would have very low density (i.e., less than 25% dense)
but, in the case of machine learning, you run the risk of not having enough
information to learn from if your matrix/array is too sparse (i.e., less than 5%
dense). This philisophical discussion is probably beyond the scope of this blog
post, which is to explain how to take your data and to create a sparse
matrix/array that can be used for building a machine learning model.
In fact, SciPy already offers fantastic support for building your own sparse
matrices. However, to
future-proof ourselves from a soon-to-be deprecated 2D numpy matrix format,
we’ll be leveraging the PyData
Sparse package for all of our
sparse nd-array needs. So, in places below where you see “sparse matrix”, know
that we really mean a “2D array” but, unlike a matrix, the array can be
generalized to higher dimensions.
So What?
If you already have your data in hand then, most of the time, converting that
numpy array or pandas dataframe into a sparse matrix is pretty trivial. However,
where things get really tricky is when your data is stored within one giant
table/file that you can’t load into memory all at once or it is stored across
multiple medium sized tables that you’ll then need to figure out how to stitch
back together in Python. In the latter case, each chunk/table/file will contain
the same number of rows (i.e., each row represents the same user) but each
chunk/table/file will only contain a different subset of feature columns.
So, the pseudocode for converting this data into a sparse matrix might look
something like:
At first glance, this seems pretty straightforward. You retrieve only one chunk
at a time, process each chunk, convert that chunk into a sparse matrix, and
append it to a growing list so that it can be combined into a single large
sparse matrix. But once you dig into it, you’ll discover all the ways that
things can go wrong! Below, we’ll show you one relatively clean approach that
will help to keep everything in order.
Getting Started
Large Wide Data
Let’s pretend that the dataframe below represents our large data which is stored
in some remote database or chunked up. Remember, since the data is “big”, the
database only allows you to fetch, say, five columns of data at a time and,
realistically, you don’t want to have to write any of this data to disk if at
all possible.
user_id | feature_a | feature_b | feature_c | feature_d | feature_e | feature_f | feature_g | feature_h | feature_i | |
---|---|---|---|---|---|---|---|---|---|---|
0 | x | 1 | 2 | 3 | 6 | 0 | 0 | 7 | 0 | 0 |
1 | y | 2 | 3 | 0 | 0 | 0 | 4 | 8 | 0 | 5 |
2 | z | 3 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 9 |
3 | w | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 |
Mimic a Chunked Data Source
Here, we use a Python generator to pretend like we are accessing our fictitious
database one subset of columns at a time. But, in the real world, this will be
replaced by your database connection of choice.
Now, we could retrieve our first chunk directly from our databse but there are a
couple of issues:
- The chunk itself is too big as it contains a lot of zeros
- SciPy expects the data to be a long format instead of a wide format
So, we should try to get the database to do as much of the heavy lifting as
possible such as to pivot the wide tables into a long format and then to remove
all zeros and only return a long table with nonzero entries. Let’s modify our
get_df_chunks
generator from above to reflect this:
Additionally, we’ll want to convert the user_id
and feature
columns to a
categorical dtype and, to further reduce the memory footprint of our sparse
matrix, we’ll force the value
to be np.float32
:
Breaking Things Down
Keeping Track of the Data Chunks
Since each long_df
chunk may contain a different/overlapping set of user_id
and feature
, we’ll need to keep track of them in the order in which they are
observed so that we can cross reference our index-less sparse matrix with a nice
user_id
or feature
lookup table. So, as each chunk is retrieved, we’ll need
to:
- Start with an empty
user_id
list of categoriesall_user_ids
- Get an ordered list of all of the
user_id
categories from the incominglong_df
- Append the incoming (ordered)
user_id
list to the (ordered) categories from the existingall_user_ids
list and remove redundant categories - Repeat steps 2-3 for all other chunks
This is also done for feature
as well. Now, you might be tempted use Python’s
built-in set
operations to help identify a non-redundant list of user_id
and
feature
but this is insufficient since order isn’t maintained! So, instead, we
initialize a couple of Pandas series for better bookkeeping:
And we’ll define a helper function to help us determine if we’ve already “seen”
a certain category type before.
So, as each chunk comes in, we’ll update the observed all_user_ids
and
all_features
categories on the fly:
Creating a Sparse Array Using the PyData Sparse Package
Installing the PyData Sparse
package should be as
straigtforward as:
According to their
documentation, you can
construct a sparse nd-array (2D in our case) by doing:
In our case, we can do something similar with our incoming long dataframe:
And, voila, we have a sparse 2D array! And as we process each incoming new
long_df
, we can add the new sparse array to our existing sparse array via the
concatenate
function:
Putting Things Altogether
This is the workhorse so please review it carefully. A lot of this stuff might
seem obvious or trivial but there are actually a lot of “gotchas” and one needs
to take care and be mindful of the pitfalls of their approach. The biggest thing
is making sure that the rows and columns our sparse array can be referenced
correctly after we’ve stitched things back together.
Indeed, our final existing_sparse_array
is sparse and has the correct
dimensions:
To confirm that our sparse array is the same as our original all_df
dataframe,
we can convert it back to dense form:
And we see that it is nearly identical to all_df
:
user_id | feature_a | feature_b | feature_c | feature_d | feature_e | feature_f | feature_g | feature_h | feature_i | |
---|---|---|---|---|---|---|---|---|---|---|
0 | x | 1 | 2 | 3 | 6 | 0 | 0 | 7 | 0 | 0 |
1 | y | 2 | 3 | 0 | 0 | 0 | 4 | 8 | 0 | 5 |
2 | z | 3 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 9 |
3 | w | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 |
The observant individual will notice that existing_sparse_array
has a missing
column. That is, feature_h
, which contains all zeros is completely missing and
this is by design since that column contains no useful information from which to
learn from. If you recall, this was handled earlier by our fake database
generator, get_df_chunks
, when we removed all nonzero elements before
returning long_df
:
We can also check to make sure that our stored all_features
categories did not
accidentally capture feature_h
and that the order of the categories should be
retained:
Success! You can also access the ordered list of all_user_ids
via:
Or save the list to a CSV file:
Finally, if you wanted to, say, build an scikit-learn or XGBoost model then all
you need to do is hand this sparse COO matrix over to the XGBoost model builder
in Compressed Sparse Row
format just as if it were a dense numpy
array:
Conclusion
From our experience, and depending on your local hardware resources, we’ve been
able to leverage this approach for data that contain around a million rows x a
thousand features (or you can have many more rows if you decrease the number of
features by an order of magnitude). Additionally, if your data is even bigger
than this, then, depending on the type of model that you are trying to build,
you may be able to use the scikit-learn’s partial_fit
(or sometimes called
warm_start
) function by breaking your data up into a smaller number of rows
and feeding the chunks back in an iterative fashion.
Final word of warning. With this sparse array approach, you definitely do not want
to normalize or standardize
your data (aka feature scaling) as both options will typically increase the density
of your array and you’ll be back to square one. Also, note that popular tree based
methods such as random forest and XGBoost do not require feature scaling as they
are both insensitive to monotonic transformations.
Additionally, in this post, we’ve made one key assumption that our data is already
cleaned (in our database) and that all of the subset of tables that are derived
from it are non-overlapping and non-redundant. Having said that, the code above
can serve as an excellent starting point for you to build from.
Hopefully, this approach will work for you as you venture down your data science
journey! Let me know what you think in the comments below.
Update
Ethan Rosenthal, a wonderful physicist-turned-data scientist
in his own right, reminded us that
scikit-learn’s DictVectorizer
is another great option for generating sparse feature matrices
if your data is small enough to fit into memory. Thanks, Ethan! We should note that, unlike
DictVectorizer
, the approach presented in this blog can handle large datasets and would only
require a single pass over the chunked data. An additional goal of this blog post was to also provide a real working example of using the PyData Sparse package.