Select Rows with Keys Matching Multiple Columns in Subquery

When you query a database table using SQL, you might find the need to:

select rows from table A using a certain criteria (i.e., a WHERE clause)
then, use one or more columns from result set (coming from the above query) as a subquery to subselect from table B

You can do this quite easily in SQL

import pandas as pd
from pandasql import sqldf  # pip install pandasql from Yhat

df_vals = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'], 
                        'key2': ['B', 'Z', 'D', 'F', 'H'], 
                        'val': ['2','3','4','5','6']})

df_vals

	key1	key2	val
0	A	B	2
1	A	Z	3
2	C	D	4
3	E	F	5
4	G	H	6

df_colors = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'], 
                          'key2': ['B', 'Z', 'D', 'F', 'H'], 
                          'color': ['red','orange','yellow','green','blue']})

df_colors

	color	key1	key2
0	red	A	B
1	orange	A	Z
2	yellow	C	D
3	green	E	F
4	blue	G	H

So, if we wanted to grab all rows from df_colors where the value in df_vals is inclusively between 2 and 6, then:

query = '''
        SELECT a.* FROM df_colors a
        INNER JOIN (
            SELECT key1, key2 FROM df_vals
            WHERE val > 2 and val < 6
        ) b
        ON a.key1 = b.key1 and a.key2 = b.key2
        '''
sqldf(query, locals())

	color	key1	key2
0	orange	A	Z
1	yellow	C	D
2	green	E	F

Notice that the inner subquery produces the keys where the values are between 2 and 6

query = '''
        SELECT key1, key2 FROM df_vals
        WHERE val > 2 and val < 6
        '''
sqldf(query, locals())

	key1	key2
0	A	Z
1	C	D
2	E	F

Then, each pair of keys are used to subselect rows from df_colors. So, the ON keyword:

Grabs the first pair of keys (A, B) from df_colors and compares them to each pair of keys returned by the subquery [(A, Z), (C,D), (E, F)]
If there is a match, then that row from df_colors is returned
This process (Steps 1-2) is repeated for the remaining rows in df_colors [(A, Z), (C, D), (E, F), (G, H)]

The query above is equivalent to the following:

query = """
        SELECT a.* from df_colors a
        WHERE (a.key1, a.key2) in (
            SELECT key1, key2 from df_vals where val > 2 and val < 6
        ) 
        """

Note: This latter query doesn’t work in sqlite3 since tuples aren’t supported but it should work fine for a real database. Note that it might be slower than the first query and is a bit less obvious. So, it is advised that you stick with the first query.

Published

Oct 17, 2016