How-To#
Query CSV files with SQL#
You can use JupySQL
and DuckDB
to query CSV files with SQL in a Jupyter notebook.
Installation#
%pip install jupysql duckdb duckdb-engine --quiet
Note: you may need to restart the kernel to use updated packages.
Setup#
Load JupySQL:
%load_ext sql
Create an in-memory DuckDB database:
%sql duckdb://
Download some sample data:
from urllib.request import urlretrieve
_ = urlretrieve(
"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
"penguins.csv",
)
Query#
%%sql
SELECT *
FROM penguins.csv
LIMIT 3
* duckdb://
Done.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | MALE |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | FEMALE |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | FEMALE |
%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
* duckdb://
Done.
species | count |
---|---|
Adelie | 152 |
Gentoo | 124 |
Chinstrap | 68 |
Convert to polars.DataFrame
#
%%sql results <<
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
* duckdb://
Done.
import polars as pl
pl.DataFrame((tuple(row) for row in results), schema=results.keys)
species | count |
---|---|
str | i64 |
"Adelie" | 152 |
"Gentoo" | 124 |
"Chinstrap" | 68 |
Register SQLite UDF#
To register a user-defined function (UDF) when using SQLite, you can use SQLAlchemy’s @event.listens_for
and SQLite’s create_function
:
Install JupySQL#
%pip install jupysql --quiet
Note: you may need to restart the kernel to use updated packages.
Create engine and register function#
from sqlalchemy import create_engine
from sqlalchemy import event
def mysum(x, y):
return x + y
engine = create_engine("sqlite://")
@event.listens_for(engine, "connect")
def connect(conn, rec):
conn.create_function(name="MYSUM", narg=2, func=mysum)
Create connection with existing engine#
New in version 0.5.1: Pass existing engines to %sql
%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
%sql engine
Query#
%%sql
SELECT MYSUM(1, 2)
duckdb://
* sqlite://
Done.
MYSUM(1, 2) |
---|
3 |
Connect to a SQLite database with spaces#
Currently, due to a limitation in the argument parser, it’s not possible to directly connect to SQLite databases whose path contains spaces; however, you can do it by creating the engine first.
Setup#
%pip install jupysql --quiet
Note: you may need to restart the kernel to use updated packages.
%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
Connect to db#
from sqlalchemy import create_engine
engine = create_engine("sqlite:///my database.db")
Add some sample data:
import pandas as pd
_ = pd.DataFrame({"x": range(5)}).to_sql("numbers", engine)
%sql engine
%%sql
SELECT * FROM numbers
duckdb://
sqlite://
* sqlite:///my database.db
Done.
index | x |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
Switch connections#
New in version 0.5.2: -A/--alias
from sqlalchemy import create_engine
import pandas as pd
engine_one = create_engine("sqlite:///one.db")
pd.DataFrame({"x": range(5)}).to_sql("one", engine_one)
engine_two = create_engine("sqlite:///two.db")
_ = pd.DataFrame({"x": range(5)}).to_sql("two", engine_two)
%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
Assign alias to both connections so we can switch them by it:
%sql sqlite:///one.db --alias one
%sql sqlite:///two.db --alias two
%sql
duckdb://
(one) sqlite:///one.db
sqlite://
sqlite:///my database.db
* (two) sqlite:///two.db
Pass the alias to use such connection:
%%sql one
SELECT * FROM one
Done.
index | x |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
It becomes the current active connection:
%sql
duckdb://
* (one) sqlite:///one.db
sqlite://
sqlite:///my database.db
(two) sqlite:///two.db
Hence, we can skip it in upcoming queries:
%%sql
SELECT * FROM one
duckdb://
* (one) sqlite:///one.db
sqlite://
sqlite:///my database.db
(two) sqlite:///two.db
Done.
index | x |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
Switch connection:
%%sql two
SELECT * FROM two
Done.
index | x |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
%sql
duckdb://
(one) sqlite:///one.db
sqlite://
sqlite:///my database.db
* (two) sqlite:///two.db
Close by passing the alias:
%sql --close one
%sql
duckdb://
sqlite://
sqlite:///my database.db
* (two) sqlite:///two.db
%sql --close two
%sql -l
{'duckdb://': <sql.connection.Connection at 0x7f83ec265bd0>,
'sqlite://': <sql.connection.Connection at 0x7f83ec15faf0>,
'sqlite:///my database.db': <sql.connection.Connection at 0x7f83db845ab0>}
Connect to existing engine
#
Pass the name of the engine:
some_engine = create_engine("sqlite:///some.db")
%sql some_engine
Use %sql
/%%sql
in Databricks#
Databricks uses the same name (%sql
/%%sql
) for its SQL magics; however, JupySQL exposes a %jupysql
/%%jupysql
alias so you can use both:
%jupysql duckdb://
%jupysql SELECT * FROM "penguins.csv" LIMIT 3
* duckdb://
sqlite://
sqlite:///my database.db
sqlite:///some.db
Done.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | MALE |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | FEMALE |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | FEMALE |
%%jupysql
SELECT *
FROM "penguins.csv"
LIMIT 3
* duckdb://
sqlite://
sqlite:///my database.db
sqlite:///some.db
Done.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | MALE |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | FEMALE |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | FEMALE |
Ignore deprecation warnings#
We display warnings to let you know when the API will change so you have enough time to update your code, if you want to supress this warnings, add this at the top of your notebook:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)