• STUMPY - A Re-newed Approach to Time Series Analysis

    May 13, 2019


    Thanks to the support of TD Ameritrade, I recently open sourced (BSD-3-Clause) a new, powerful, and scalable Python library called STUMPY that can be used for a variety of time series data mining tasks. At the heart of it, this library takes any time series or sequential data and efficiently computes something called the matrix profile, which, with only a few extra lines of code, enables you to perform:

    • pattern/motif (approximately repeated subsequences within a longer time series) discovery
    • anomaly/novelty (discord) discovery
    • shapelet discovery
    • semantic segmentation
    • density estimation
    • time series chains (temporally ordered set of subsequence patterns)
    • and more…



    First, let’s install stumpy with Conda (preferred):

    conda install -c conda-forge stumpy



    or, alternatively, you can install stumpy with Pip:

    pip install stumpy



    Once stumpy is installed, typical usage would be to take your time series and compute the matrix profile:

    import stumpy
    import numpy as np
    
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern
    
    matrix_profile = stumpy.stump(your_time_series, m=window_size)



    For a more detailed example, check out our tutorials and documentation or feel free to file a Github issue. We welcome contributions in any form!

    I’d love to hear from you so let me know what you think!




  • Discussing Data Science R&D on the DataFramed Podcast

    Apr 1, 2019


    Due to DataCamp’s internal mishandling of the sexual assault, I will no longer be promoting this podcast recording and, instead, encourage you to read more about what happened here.

    I was recently invited to sit down and chat with Hugo Bowne-Anderson on the DataFramed podcast to talk Data Science R&D. Have a listen and leave your comments below!




  • Setting Values of a Sparse Matrix

    Feb 27, 2019


    Let’s say that you have a sparse matrix:

    import numpy as np
    from scipy.sparse import
    
    x = csr_matrix(np.array([[1, 0, 2, 0, 3], 
                             [0, 4, 0, 5, 0]]))
    print(x)


    <2x5 sparse matrix of type '<class 'numpy.int64'>'
        with 5 stored elements in Compressed Sparse Row format>



    One of the most common things that you might want to do is to make a conditional selection from the matrix and then set those particular elements of the matrix to, say, zero. For example, we can take our matrix from above and set all elements that have a value that are less than three to zero. Naively, one could do:

    x[x < 3] = 0



    This works and is fine for small matrices. However, you’ll likely encounter a warning message such as the following:

    /home/miniconda3/lib/python3.6/site-packages/scipy/sparse/compressed.py:282: SparseEfficiencyWarning: Comparing a sparse matrix with a scalar greater than zero using < is inefficient, try using >= instead.
      warn(bad_scalar_msg, SparseEfficiencyWarning)



    The problem here is that for large sparse matrices, the majority of the matrix is full of zeros and so the < comparison becomes highly inefficient. Instead, you really only want to perform your comparison only with the nonzero elements of the matrix. However, this takes a little more work and a few more lines of code to accomplish the same thing. Additionally, we want to avoid converting our sparse matrices into costly dense arrays.




  • Pip Installing Wheels with Conda GCC/G++

    Jan 17, 2019


    I was trying to pip install a simple package that contained wheels that needed to be compiled with both GCC and G++. Of course, without using SUDO (i.e., yum install gcc) meant that I needed to rely on my good friend, Conda:

    conda install gcc_linux-64
    conda install gxx_linux-64



    Now, we aren’t done yet! According to the conda documentation, the compilers are found in /path/to/anaconda/bin but the gcc and g++executables are prefixed with something like x86_64-conda-cos6-linux-gnu-gcc. So, we’ll need to create some symbolic links to these executables:

    ln -s /path/to/anaconda/bin/x86_64-conda-cos6-linux-gnu-gcc /path/to/anaconda/bin/gcc
    ln -s /path/to/anaconda/bin/x86_64-conda-cos6-linux-gnu-g++ /path/to/anaconda/bin/g++



    Now, my pip install command is able to compile the wheel successfully!




  • Compiling Facebook's StarSpace with Conda Boost

    Jun 17, 2018


    Recently, I was playing around with Facebook’s StarSpace, a general-purpose neural model for efficient learning of entity embeddings for solving a wide variety of problems. According to the installation instructions, you need a C++11 compiler and the Boost library. I already had GCC installed and Boost was only a quick conda command away:

    conda install boost



    The StarSpace Makefile is hardcoded to look for the Boost library in /usr/local/bin/boost_1_63_0/, which is a problem. But how should I modify the StarSpace Makefile so that it knew where to include the Boost library? After a little digging, I found the Boost files in /path/to/anaconda/include. So, all I had to do was modify the following line in the StarSpace Makefile:

    #BOOST_DIR = /usr/local/bin/boost_1_63_0/
    BOOST_DIR = /path/to/anaconda/include/



    Executed make on the command line and everything compiled nicely! Yay!




  • I'm Melting! From Wide to Long Format and Quarterly Groupby

    Dec 27, 2017


    Recently, a colleague of mine asked me how one might go about taking a dataset that is in wide format and converting it into long format so that you could then perform some groupby operations by quarter.

    Here’s a quick example to illustrate one way to go about this using the Pandas melt function.

    Getting Started



    Let’s import the Pandas package

    import pandas as pd

    Load Some Data



    First, we’ll create a fake dataframe that contains the name of a state and city along with some data for each month in the year 2000. For simplicity, imagine that the data are the number of Canadians spotted eating poutine.

    df = pd.DataFrame([['NY', 'New York', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 
                       ['MI', 'Ann Arbor', 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
                       ['OR', 'Portland', 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
                      ],
                      columns=['state', 'city', 
                               '2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
                               '2000-07', '2000-08', '2000-09', '2000-10', '2000-11', '2000-12',
                               '2001-01'
                              ])
    df




  • Select Rows with Keys Matching Multiple Columns in Subquery

    Oct 17, 2016


    When you query a database table using SQL, you might find the need to:

    1. select rows from table A using a certain criteria (i.e., a WHERE clause)
    2. then, use one or more columns from result set (coming from the above query) as a subquery to subselect from table B

      You can do this quite easily in SQL

    import pandas as pd
    from pandasql import sqldf  # pip install pandasql from Yhat



    df_vals = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'], 
                            'key2': ['B', 'Z', 'D', 'F', 'H'], 
                            'val': ['2','3','4','5','6']})
    
    df_vals



    key1 key2 val
    0 A B 2
    1 A Z 3
    2 C D 4
    3 E F 5
    4 G H 6



    df_colors = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'], 
                              'key2': ['B', 'Z', 'D', 'F', 'H'], 
                              'color': ['red','orange','yellow','green','blue']})
    
    df_colors



    color key1 key2
    0 red A B
    1 orange A Z
    2 yellow C D
    3 green E F
    4 blue G H



    So, if we wanted to grab all rows from df_colors where the value in df_vals is inclusively between 2 and 6, then:




  • Pandas Split-Apply-Combine Example

    May 28, 2016


    There are times when I want to use split-apply-combine to save the results of a groupby to a json file while preserving the resulting column values as a list. Before we start, let’s import Pandas and generate a dataframe with some example email data

    Import Pandas and Create an Email DataFrame



    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'Sender': ['Alice', 'Alice', 'Bob', 'Carl', 'Bob', 'Alice'],
                       'Receiver': ['David', 'Eric', 'Frank', 'Ginger', 'Holly', 'Ingrid'],
                       'Emails': [9, 3, 5, 1, 6, 7]
                      })
    df




  • Pandas End-to-End Example

    May 25, 2016


    The indexing capabilities that come with Pandas are incredibly useful. However, I find myself forgetting the concepts beyond the basics when I haven’t touched Pandas in a while. This tutorial serves as my own personal reminder but I hope others will find it helpful as well.

    To motivate this, we we’ll explore a baseball dataset and plot batting averages for some of the greatest players of all time.




  • Fetching Conda Packages Behind a Firewall

    Dec 23, 2015


    One of the most annoying things is not being able to update software if you’re behind a network firewall that requires SSL verification. You can turn this off in Anaconda via

    conda config --set ssl_verify no



    and for pip via

    pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org <package name>



    Optionally, you can also specify the package version like this:

    pip install --trusted-host pypi.python.org --trusted-host pypi.org --trusted-host files.pythonhosted.org <package name>[=0.1.2]



    Better yet, you can permanently set the trusted-host by adding the following to the $HOME/.pip/pip.conf file:

    [global]
    trusted-host = pypi.python.org
                   files.pythonhosted.org
                   pypi.org




  • Convert a Pandas DataFrame to Numeric

    Dec 15, 2015


    Pandas has deprecated the use of convert_object to convert a dataframe into, say, float or datetime. Instead, for a series, one should use:

    df['A'] = df['A'].to_numeric()


    or, for an entire dataframe:

    df = df.apply(to_numeric)





  • Python 2 Unicode Problem

    Dec 1, 2015


    The following Python error is one of the most annoying one’s I’ve ever encountered:

    Traceback (most recent call last):
      File "./test.py", line 3, in <module>
        print out
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 0: ordinal not in range(128)


    Essentially, you can’t write unicode characters as string unless you’ve converted the text to a string first before printing it. A detailed explanation can be found in Kumar McMillan’s wonderful talk titled ‘Unicode in Python, Completely Demystified’. To summarize, McMillan offers three useful yet simple rules:




  • 2015 Nobel Prize in Chemistry Awarded to DNA Repair

    Oct 7, 2015


    Today, the 2015 Nobel Prize in Chemistry was awarded to the field of DNA repair. I am especially excited by this news since I had spent six years researching the role that DNA base-flipping plays in DNA repair when I was a graduate student studying at Michigan State University under the mentorship of Dr. Michael Feig. Thus, my research sat at the crossroads between the exciting worlds of computational chemistry (which was awarded the Nobel Prize in Chemistry two years ago in 2013) and DNA repair which have ultimately shaped my appreciation for doing science.

    Dr. Paul Modrich, one of the three Nobel Prize recipients this year, is a pioneer in the field of DNA mismatch repair and has spent decades trying to understand the mechanism by which humans (and other eukaryotes) maintain the efficacy and fidelity of their genome. As a computational biochemist/biophysicist, I am honored to have had the opportunity to make significant contributions to this field of research and am delighted to see this area be recognized!

    Others scientists who have also made an impact in the area of DNA mismatch repair include (in no particular order) Drs. Richard Kolodner, Richard Fishel, Thomas Kunkel, Dorothy Erie, Manju Hingorani, Peggy Hsieh, Shayantani Mukherjee, Alexander Predeus, Meindert Lamers, Titia Sixma, et al.

    Congratulations to all!




  • Installing Downloaded Anaconda Python Packages

    Sep 22, 2015


    If you work in a secure network at your job then conda may not be able to hit the Anaconda repositories directly even if it’s for accessing free packages. Additionally, it’s not recommended to use pip over conda when installing new packages. However, installing new packages can be done manually by:

    1. Downloading the package(s) (and its necessary dependencies) directly from the Continuum Repo
    2. And installing the tar.bz2 file using conda install ./package_name.tar.bz2




  • Using NumPy Argmin or Argmax Along with a Conditional

    Sep 10, 2015


    It’s no secret that I love me some Python! Yes, even more than Perl, my first love from my graduate school days.

    I’ve always found NumPy to be great for manipulating, analyzing, or transforming arrays containing large numerical data sets. It is both fast and efficient and it comes with a tonne of great functions.




  • Modern Data Scientist

    Aug 8, 2015


    The qualities of a modern data scientist is summed up very nicely in this article/guide and image by Marketing Distillery. As they point out, the team should be composed of people with a “mixture of broad, multidisciplinary skills ranging from an intersection of mathematics, statistics, computer science, communication and business”. More importantly:

    “Being a data scientist is not only about data crunching. It’s about understanding the business challenge, creating some valuable actionable insights to the data, and communicating their findings to the business”.

    I couldn’t agree more!




  • Anaconda Environment

    Jul 9, 2015


    I’ve been using Continuum’s enterprise Python distribution package, Anaconda, for several months now and I love it. Recently, people have been asking about Python 2.7 vs Python 3.x and so I looked into how to switch between these environments using Anaconda.

    In fact, it’s quite straightforward and painless with Anaconda.

    To set up a new environment with Python 3.4:




  • Stitch Fix Loves UNIX

    May 27, 2015


    The wonderful group of people at Stitch Fix has shared an informative list of useful UNIX commands. Go check it out now!




  • Drafts in Jekyll

    Mar 14, 2015


    The great thing about Jekyll is that you can start writing a draft without publishing it and still be able to see the post locally.

    1. Create a draft directory called _drafts in the root directory
    2. Create a new post in this directory but omit the date in the file name
    3. Serve up the page locally using jekyll serve --drafts

    Then, Jekyll will automatically adjust the date of the post to today’s date and display the post as the most recent post. Note that this post won’t be displayed on your github pages since they aren’t using the --drafts option. So, you’ll be able to save all of your drafts without worrying about them showing up on your live site. Once the post is ready for the prime time, then simply move it over to the _posts directory and prepend a date to the file name. That’s it!




  • 3D Coordinates Represented on a 2D Triangle

    Mar 6, 2015


    I came across this interesting way of showing 3D coordinates on a 2D triangle published in the Journal of Physical Chemistry B. It takes a minute to orient yourself and figure out how best to interpret the results but the idea is pretty cool. I wonder what type of geometric transformation is need to create this plot. Once I figure it out, I’ll be sure to blog about it and prototype it in Python!




  • MathJax

    Feb 28, 2015


    As I embark on the PyESL project, I’ll need to include math equations in future blog posts. The easiest way to accomplish this is to use MathJax so that I can incorporate Tex/LaTeX/MathML-based equations within HTML. In Jekyll, all you need to do is add the MathJax javascript to the header section of your default.html and add a new variable to your _config.yml file.

    <head>
        ...
        <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
    </script>
        ...
    </head>
    


    and add the following to your _config.yml file:

    markdown: kramdown
    


    For example, this markdown:

    Inline equation \\( {y} = {m}{x} + {b} \\) and block equation \\[ {y} = {m}{x}+{b} \\] 
    


    produces:

    Inline equation \( {y} = {m}{x}+{b} \) and block equation \[ {y} = {m}{x}+{b} \]

    Here, the parentheses denote an inline equation while the square brackets denote a block equation.


    And this is a multiline equation:




  • Tag Aware Previous/Next Links for Jekyll

    Feb 22, 2015


    Creating and maintaining a vanilla Jekyll-Boostrap website is pretty straightforward. However, I couldn’t find an obvious way to customize the previous/next links below each blog post so that:

    1. The links were aware of the tags listed in the front matter
    2. The method did not depend on plugin (since my site is being hosted on Github)


    After tonnes of digging, I managed to piece together a Liquid-based solution (see my last post, to see how I add Liquid code in Jekyll)!




  • Escaping Liquid Code in Jekyll

    Feb 21, 2015


    To document some of my challenges in customizing this site, I’ve had to delve into Liquid code. However, adding Liquid code tags in Jekyll can be quite tricky and painful. Luckily, some smart people have identified a couple of nice solutions exists. Below is the markdown code that I’ve adopted for use in future posts:

    {% highlight html %}{% raw %}
    \\Place Liquid and HTML code here
    {% endraw %}{% endhighlight %}




  • Elements of Statistical Learning

    Feb 20, 2015


    My new book purchase, Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman arrived in the mail the last week and I’m excited to get reading! Springer was also kind enough to make this classic book available free to download. Get your copy here! Python implementations of each chapter will be added in the PyESL section.




  • Github and Jekyll-Bootstrap, FTW!

    Feb 17, 2015


    My blog is finally up and running! Currently, it’s being hosted (for free) and backed up on Github pages using vanilla Jekyll-Bootstrap. Font awesome, which definitely lives up to its name, was used to produce the social icons along the navigation bar and Dropbox is being used for redundancy. More design customizations will follow but I’m loving how easy the process has been!