When you query a database table using SQL, you might find the need to:

  1. select rows from table A using a certain criteria (i.e., a WHERE clause)
  2. then, use one or more columns from result set (coming from the above query) as a subquery to subselect from table B

    You can do this quite easily in SQL

import pandas as pd
from pandasql import sqldf  # pip install pandasql from Yhat



df_vals = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'], 
                        'key2': ['B', 'Z', 'D', 'F', 'H'], 
                        'val': ['2','3','4','5','6']})

df_vals



key1 key2 val
0 A B 2
1 A Z 3
2 C D 4
3 E F 5
4 G H 6



df_colors = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'], 
                          'key2': ['B', 'Z', 'D', 'F', 'H'], 
                          'color': ['red','orange','yellow','green','blue']})

df_colors



color key1 key2
0 red A B
1 orange A Z
2 yellow C D
3 green E F
4 blue G H



So, if we wanted to grab all rows from df_colors where the value in df_vals is inclusively between 2 and 6, then:

query = '''
        SELECT a.* FROM df_colors a
        INNER JOIN (
            SELECT key1, key2 FROM df_vals
            WHERE val > 2 and val < 6
        ) b
        ON a.key1 = b.key1 and a.key2 = b.key2
        '''
sqldf(query, locals())



color key1 key2
0 orange A Z
1 yellow C D
2 green E F



Notice that the inner subquery produces the keys where the values are between 2 and 6

query = '''
        SELECT key1, key2 FROM df_vals
        WHERE val > 2 and val < 6
        '''
sqldf(query, locals())



key1 key2
0 A Z
1 C D
2 E F



Then, each pair of keys are used to subselect rows from df_colors. So, the ON keyword:

  1. Grabs the first pair of keys (A, B) from df_colors and compares them to each pair of keys returned by the subquery [(A, Z), (C,D), (E, F)]
  2. If there is a match, then that row from df_colors is returned
  3. This process (Steps 1-2) is repeated for the remaining rows in df_colors [(A, Z), (C, D), (E, F), (G, H)]

    The query above is equivalent to the following:

query = """
        SELECT a.* from df_colors a
        WHERE (a.key1, a.key2) in (
            SELECT key1, key2 from df_vals where val > 2 and val < 6
        ) 
        """



Note: This latter query doesn’t work in sqlite3 since tuples aren’t supported but it should work fine for a real database. Note that it might be slower than the first query and is a bit less obvious. So, it is advised that you stick with the first query.


Published

Oct 17, 2016