Leveraging Sparse Arrays for Large-ish (Feature) Data Sets

When building a machine learning model, you’ll often have your feature data stored in one or more database tables (e.g., with a couple of million rows and maybe a thousand columns) where each row represents, say, an individual user and each column represents some feature that you’ve engineered (i.e., number of logins to your site, number of daily purchases, number of ads clicked on, 52 consecutive columns storing a weekly moving average over a year, etc). Since storage is cheap and you can’t afford to pay for a compute cluster, your instinct might be to download the full table onto your laptop as a quick and dirty CSV file or, possibly, in a more portable and compact parquet format. However, depending on your local computer hardware, trying to load that full data set into a Pandas DataFrame might consume all of your memory and cause your system to grind to a halt!

Luckily, there is one approach that a lot of people often overlook but that might work wonderfully for you. The secret is to exploit the inherent sparsity that likely exists within your large data set! As an illustrative example, let’s consider the following list of ones and zeros:

[0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

To capture all of the information to represent this list, all that you really need is:

The total length of the list
The exact position of each 1 (the position of each 0 is implied)

In other words, since the majority of the list is full of zeros, we can start by assuming that the entire list is full of zeros with the exception of the locations of the 1s. In doing so, we end up only needing to record a few bits of information and thereby saving us a ton of memory to do other interesting things. The number of nonzero elements that exist in your matrix/array also defines the density of your matrix. That is, the more nonzero elements that you have (relative to the total number of elements) then the higher the density. A sparse matrix/array would have very low density (i.e., less than 25% dense) but, in the case of machine learning, you run the risk of not having enough information to learn from if your matrix/array is too sparse (i.e., less than 5% dense). This philisophical discussion is probably beyond the scope of this blog post, which is to explain how to take your data and to create a sparse matrix/array that can be used for building a machine learning model.

In fact, SciPy already offers fantastic support for building your own sparse matrices. However, to future-proof ourselves from a soon-to-be deprecated 2D numpy matrix format, we’ll be leveraging the PyData Sparse package for all of our sparse nd-array needs. So, in places below where you see “sparse matrix”, know that we really mean a “2D array” but, unlike a matrix, the array can be generalized to higher dimensions.

So What?

If you already have your data in hand then, most of the time, converting that numpy array or pandas dataframe into a sparse matrix is pretty trivial. However, where things get really tricky is when your data is stored within one giant table/file that you can’t load into memory all at once or it is stored across multiple medium sized tables that you’ll then need to figure out how to stitch back together in Python. In the latter case, each chunk/table/file will contain the same number of rows (i.e., each row represents the same user) but each chunk/table/file will only contain a different subset of feature columns.

So, the pseudocode for converting this data into a sparse matrix might look something like:

sparse_matrix_list = []
db_cnxn = connect_to_db()
for chunk in db_cnxn.get(chunks):
    sparse_chunk = convert_chunk_to_sparse(chunk)
    sparse_matrix_list.append(sparse_chunk)

At first glance, this seems pretty straightforward. You retrieve only one chunk at a time, process each chunk, convert that chunk into a sparse matrix, and append it to a growing list so that it can be combined into a single large sparse matrix. But once you dig into it, you’ll discover all the ways that things can go wrong! Below, we’ll show you one relatively clean approach that will help to keep everything in order.

Getting Started

import sparse
import pandas as pd
import numpy as np

Large Wide Data

Let’s pretend that the dataframe below represents our large data which is stored in some remote database or chunked up. Remember, since the data is “big”, the database only allows you to fetch, say, five columns of data at a time and, realistically, you don’t want to have to write any of this data to disk if at all possible.

all_df = pd.DataFrame([['x', 1, 2, 3, 6, 0, 0, 7, 0, 0], 
                       ['y', 2, 3, 0, 0, 0, 4, 8, 0, 5], 
                       ['z', 3, 0, 1, 0, 1, 2, 0, 0, 9], 
                       ['w', 0, 0, 0, 0, 3, 0, 1, 0, 0]], 
                      columns=['user_id', 
                               'feature_a', 
                               'feature_b', 
                               'feature_c', 
                               'feature_d', 
                               'feature_e', 
                               'feature_f', 
                               'feature_g', 
                               'feature_h', 
                               'feature_i'])
all_df

	user_id	feature_a	feature_b	feature_c	feature_d	feature_e	feature_f	feature_g	feature_i
0	x	1	2	3	6	0	0	7	0
1	y	2	3	0	0	0	4	8	5
2	z	3	0	1	0	1	2	0	9
3	w	0	0	0	0	3	0	1	0

Mimic a Chunked Data Source

Here, we use a Python generator to pretend like we are accessing our fictitious database one subset of columns at a time. But, in the real world, this will be replaced by your database connection of choice.

def get_df_chunks(df):
    for chunk in [['user_id', 'feature_a', 'feature_b', 'feature_c'], 
                  ['user_id', 'feature_d', 'feature_e', 'feature_f', 'feature_g'], 
                  ['user_id', 'feature_h', 'feature_i']]:
        
        yield df[chunk]

Now, we could retrieve our first chunk directly from our databse but there are a couple of issues:

The chunk itself is too big as it contains a lot of zeros
SciPy expects the data to be a long format instead of a wide format

So, we should try to get the database to do as much of the heavy lifting as possible such as to pivot the wide tables into a long format and then to remove all zeros and only return a long table with nonzero entries. Let’s modify our get_df_chunks generator from above to reflect this:

def get_df_chunks(df):
    for chunk in [['user_id', 'feature_a', 'feature_b', 'feature_c'], 
                  ['user_id', 'feature_d', 'feature_e', 'feature_f', 'feature_g'], 
                  ['user_id', 'feature_h', 'feature_i']]:
        
        long_df = df[chunk].melt(id_vars='user_id', var_name='feature')  # Pivot to long format
        long_df = long_df[long_df['value'] != 0]  # Remove nonzero elements
        
        yield long_df

for long_df in get_df_chunks(all_df):
    print(long_df, end="\n\n")

       user_id    feature  value
    0        x  feature_a      1
    1        y  feature_a      2
    2        z  feature_a      3
    4        x  feature_b      2
    5        y  feature_b      3
    8        x  feature_c      3
    10       z  feature_c      1
    
       user_id    feature  value
    0        x  feature_d      6
    6        z  feature_e      1
    7        w  feature_e      3
    9        y  feature_f      4
    10       z  feature_f      2
    12       x  feature_g      7
    13       y  feature_g      8
    15       w  feature_g      1
    
      user_id    feature  value
    5       y  feature_i      5
    6       z  feature_i      9

Additionally, we’ll want to convert the user_id and feature columns to a categorical dtype and, to further reduce the memory footprint of our sparse matrix, we’ll force the value to be np.float32:

for long_df in get_df_chunks(all_df):
    long_df[['user_id', 'feature']] = long_df[['user_id', 'feature']].astype('category')
    long_df['value'] = long_df['value'].astype(np.float32)
    
    # Convert to sparse

Breaking Things Down

Keeping Track of the Data Chunks

Since each long_df chunk may contain a different/overlapping set of user_id and feature, we’ll need to keep track of them in the order in which they are observed so that we can cross reference our index-less sparse matrix with a nice user_id or feature lookup table. So, as each chunk is retrieved, we’ll need to:

Start with an empty user_id list of categories all_user_ids
Get an ordered list of all of the user_id categories from the incoming long_df
Append the incoming (ordered) user_id list to the (ordered) categories from the existing all_user_ids list and remove redundant categories
Repeat steps 2-3 for all other chunks

This is also done for feature as well. Now, you might be tempted use Python’s built-in set operations to help identify a non-redundant list of user_id and feature but this is insufficient since order isn’t maintained! So, instead, we initialize a couple of Pandas series for better bookkeeping:

all_user_ids = pd.Series([], dtype='category')
all_features = pd.Series([], dtype='category')

And we’ll define a helper function to help us determine if we’ve already “seen” a certain category type before.

def return_new_categories(add_cat, old_cat):
    old_cat_set = set(old_cat)  # this reduces the lookup time from O(n) to O(1)
    return [cat for cat in add_cat if cat not in old_cat_set]

So, as each chunk comes in, we’ll update the observed all_user_ids and all_features categories on the fly:

all_user_ids = pd.Series([], dtype='category')
all_features = pd.Series([], dtype='category')

for i, long_df in enumerate(get_df_chunks(all_df)):
    
    long_df[['user_id', 'feature']] = long_df[['user_id', 'feature']].astype('category')
    long_df['value'] = long_df['value'].astype(np.float32)

    new_ids = return_new_categories(list(long_df['user_id'].cat.categories), list(all_user_ids.cat.categories))
    all_user_ids.cat.add_categories(new_ids, inplace=True)
    
    new_features = return_new_categories(list(long_df['feature'].cat.categories), list(all_features.cat.categories))
    all_features.cat.add_categories(new_features, inplace=True)
    
    long_df['user_id'].cat.set_categories(all_user_ids.cat.categories, inplace=True)

Creating a Sparse Array Using the PyData Sparse Package

Installing the PyData Sparse package should be as straigtforward as:

pip install sparse

According to their documentation, you can construct a sparse nd-array (2D in our case) by doing:

import sparse

coords = [[0, 1, 2, 3, 4],
          [0, 1, 2, 3, 4]]
data = [10, 20, 30, 40, 50]
s = sparse.COO(coords, data, shape=(5, 5))

s.todense()
# array([[10,  0,  0,  0,  0],
#       [ 0, 20,  0,  0,  0],
#       [ 0,  0, 30,  0,  0],
#       [ 0,  0,  0, 40,  0],
#       [ 0,  0,  0,  0, 50]])

In our case, we can do something similar with our incoming long dataframe:

nrow = all_df.shape[0]  # This is the number of rows of user_id in the original dataframe
ncol = long_df['feature'].cat.categories.shape[0]  # This is the number of unique features in this long_df

coords = (long_df['user_id'].cat.codes.copy(), long_df['feature'].cat.codes.copy())
data = long_df['value']

some_sparse_array = sparse.COO(coords, data, shape=(nrow, ncol)).astype(np.float32)

And, voila, we have a sparse 2D array! And as we process each incoming new long_df, we can add the new sparse array to our existing sparse array via the concatenate function:

existing_sparse_array = sparse.concatenate([existing_sparse_array, new_sparse_array], axis=1)

Putting Things Altogether

This is the workhorse so please review it carefully. A lot of this stuff might seem obvious or trivial but there are actually a lot of “gotchas” and one needs to take care and be mindful of the pitfalls of their approach. The biggest thing is making sure that the rows and columns our sparse array can be referenced correctly after we’ve stitched things back together.

all_user_ids = pd.Series([], dtype='category')
all_features = pd.Series([], dtype='category')

for i, long_df in enumerate(get_df_chunks(all_df)):
    
    long_df[['user_id', 'feature']] = long_df[['user_id', 'feature']].astype('category')
    long_df['value'] = long_df['value'].astype(np.float32)

    new_ids = return_new_categories(list(long_df['user_id'].cat.categories), list(all_user_ids.cat.categories))
    all_user_ids.cat.add_categories(new_ids, inplace=True)
    new_features = return_new_categories(list(long_df['feature'].cat.categories), list(all_features.cat.categories))
    all_features.cat.add_categories(new_features, inplace=True)
    
    long_df['user_id'].cat.set_categories(all_user_ids.cat.categories, inplace=True)
    
    nrow = all_df.shape[0]
    ncol = long_df['feature'].cat.categories.shape[0]
    
    coords = (long_df['user_id'].cat.codes.copy(), long_df['feature'].cat.codes.copy())
    data = long_df['value']
    
    new_sparse_array = sparse.COO(coords, data, shape=(nrow, ncol)).astype(np.float32)
    
    if i == 0:
        existing_sparse_array = new_sparse_array
    else:
        existing_sparse_array = sparse.concatenate([existing_sparse_array, new_sparse_array], axis=1)

Indeed, our final existing_sparse_array is sparse and has the correct dimensions:

existing_sparse_array

# <COO: shape=(4, 8), dtype=float32, nnz=17, fill_value=0.0>

To confirm that our sparse array is the same as our original all_df dataframe, we can convert it back to dense form:

existing_sparse_array.todense()

# array([[1., 2., 3., 6., 0., 0., 7., 0.],
#        [2., 3., 0., 0., 0., 4., 8., 5.],
#        [3., 0., 1., 0., 1., 2., 0., 9.],
#        [0., 0., 0., 0., 3., 0., 1., 0.]], dtype=float32)

And we see that it is nearly identical to all_df:

all_df

	user_id	feature_a	feature_b	feature_c	feature_d	feature_e	feature_f	feature_g	feature_i
0	x	1	2	3	6	0	0	7	0
1	y	2	3	0	0	0	4	8	5
2	z	3	0	1	0	1	2	0	9
3	w	0	0	0	0	3	0	1	0

The observant individual will notice that existing_sparse_array has a missing column. That is, feature_h, which contains all zeros is completely missing and this is by design since that column contains no useful information from which to learn from. If you recall, this was handled earlier by our fake database generator, get_df_chunks, when we removed all nonzero elements before returning long_df:

long_df = long_df[long_df['value'] != 0]  # Remove nonzero elements

We can also check to make sure that our stored all_features categories did not accidentally capture feature_h and that the order of the categories should be retained:

all_features.cat.categories

# Index(['feature_a', 'feature_b', 'feature_c', 'feature_d', 'feature_e',
#        'feature_f', 'feature_g', 'feature_i'],
#       dtype='object')

Success! You can also access the ordered list of all_user_ids via:

all_user_ids.cat.categories

Or save the list to a CSV file:

all_user_ids.cat.categories.to_series().to_csv("all_user_ids.csv", header=False)

Finally, if you wanted to, say, build an scikit-learn or XGBoost model then all you need to do is hand this sparse COO matrix over to the XGBoost model builder in Compressed Sparse Row format just as if it were a dense numpy array:

clf.fit(existing_sparse_array.tocsr(), responses)

Conclusion

From our experience, and depending on your local hardware resources, we’ve been able to leverage this approach for data that contain around a million rows x a thousand features (or you can have many more rows if you decrease the number of features by an order of magnitude). Additionally, if your data is even bigger than this, then, depending on the type of model that you are trying to build, you may be able to use the scikit-learn’s partial_fit (or sometimes called warm_start) function by breaking your data up into a smaller number of rows and feeding the chunks back in an iterative fashion.

Final word of warning. With this sparse array approach, you definitely do not want to normalize or standardize your data (aka feature scaling) as both options will typically increase the density of your array and you’ll be back to square one. Also, note that popular tree based methods such as random forest and XGBoost do not require feature scaling as they are both insensitive to monotonic transformations. Additionally, in this post, we’ve made one key assumption that our data is already cleaned (in our database) and that all of the subset of tables that are derived from it are non-overlapping and non-redundant. Having said that, the code above can serve as an excellent starting point for you to build from.

Hopefully, this approach will work for you as you venture down your data science journey! Let me know what you think in the comments below.

Update

Ethan Rosenthal, a wonderful physicist-turned-data scientist in his own right, reminded us that scikit-learn’s DictVectorizer is another great option for generating sparse feature matrices if your data is small enough to fit into memory. Thanks, Ethan! We should note that, unlike DictVectorizer, the approach presented in this blog can handle large datasets and would only require a single pass over the chunked data. An additional goal of this blog post was to also provide a real working example of using the PyData Sparse package.

Published

Jul 26, 2019