Select Rows with Keys Matching Multiple Columns in Subquery
When you query a database table using SQL, you might find the need to:
- select rows from table A using a certain criteria (i.e., a WHERE clause)
- then, use one or more columns from result set (coming from the above query)
as a subquery to subselect from table B
You can do this quite easily in SQL
import pandas as pd
from pandasql import sqldf # pip install pandasql from Yhat
df_vals = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'],
'key2': ['B', 'Z', 'D', 'F', 'H'],
'val': ['2','3','4','5','6']})
df_vals
key1 | key2 | val | |
---|---|---|---|
0 | A | B | 2 |
1 | A | Z | 3 |
2 | C | D | 4 |
3 | E | F | 5 |
4 | G | H | 6 |
df_colors = pd.DataFrame({'key1': ['A', 'A','C', 'E', 'G'],
'key2': ['B', 'Z', 'D', 'F', 'H'],
'color': ['red','orange','yellow','green','blue']})
df_colors
color | key1 | key2 | |
---|---|---|---|
0 | red | A | B |
1 | orange | A | Z |
2 | yellow | C | D |
3 | green | E | F |
4 | blue | G | H |
So, if we wanted to grab all rows from df_colors where the value in df_vals is
inclusively between 2 and 6, then:
query = '''
SELECT a.* FROM df_colors a
INNER JOIN (
SELECT key1, key2 FROM df_vals
WHERE val > 2 and val < 6
) b
ON a.key1 = b.key1 and a.key2 = b.key2
'''
sqldf(query, locals())
color | key1 | key2 | |
---|---|---|---|
0 | orange | A | Z |
1 | yellow | C | D |
2 | green | E | F |
Notice that the inner subquery produces the keys where the values are between 2
and 6
query = '''
SELECT key1, key2 FROM df_vals
WHERE val > 2 and val < 6
'''
sqldf(query, locals())
key1 | key2 | |
---|---|---|
0 | A | Z |
1 | C | D |
2 | E | F |
Then, each pair of keys are used to subselect rows from df_colors.
So, the ON keyword:
- Grabs the first pair of keys (A, B) from df_colors and compares them to each pair of keys returned by the subquery [(A, Z), (C,D), (E, F)]
- If there is a match, then that row from df_colors is returned
- This process (Steps 1-2) is repeated for the remaining rows in df_colors [(A,
Z), (C, D), (E, F), (G, H)]
The query above is equivalent to the following:
query = """
SELECT a.* from df_colors a
WHERE (a.key1, a.key2) in (
SELECT key1, key2 from df_vals where val > 2 and val < 6
)
"""
Note: This latter query doesn’t work in sqlite3 since tuples aren’t supported
but it should work fine for a real database. Note that it might be slower than
the first query and is a bit less obvious. So, it is advised that you stick with
the first query.