DuckDB integration#

JupySQL integrates with DuckDB so you can run SQL queries in a Jupyter notebook. Jump into any section to learn more!

Querying a .csv file#

Installation and setup#

%pip install jupysql duckdb duckdb-engine --quiet
%load_ext sql
%sql duckdb://
Note: you may need to restart the kernel to use updated packages.

Get a sample .csv. file:

from urllib.request import urlretrieve

_ = urlretrieve(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
    "penguins.csv",
)

Query#

%%sql
SELECT *
FROM penguins.csv
LIMIT 3
*  duckdb://
Done.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie Torgersen 39.1 18.7 181 3750 MALE
Adelie Torgersen 39.5 17.4 186 3800 FEMALE
Adelie Torgersen 40.3 18.0 195 3250 FEMALE
%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
*  duckdb://
Done.
species count
Adelie 152
Gentoo 124
Chinstrap 68

Plot#

%%sql species_count <<
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
*  duckdb://
Done.
ax = species_count.bar()
# customize plot (this is a matplotlib Axes object)
_ = ax.set_title("Num of penguins by species")
../_images/e89d58f079cdc38723647e279447a072b5ca16da3c37f066eb80eac7197e5678.png

Querying a .parquet file#

Installation and setup#

%pip install jupysql duckdb duckdb-engine pyarrow --quiet
%load_ext sql
%sql duckdb://
Note: you may need to restart the kernel to use updated packages.
The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Download sample .parquet file:

from urllib.request import urlretrieve

_ = urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
    "yellow_tripdata_2021-01.parquet",
)

Query#

%%sql
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count
FROM "yellow_tripdata_2021-01.parquet"
LIMIT 3
*  duckdb://
Done.
tpep_pickup_datetime tpep_dropoff_datetime passenger_count
2021-01-01 00:30:10 2021-01-01 00:36:12 1.0
2021-01-01 00:51:20 2021-01-01 00:52:19 1.0
2021-01-01 00:43:30 2021-01-01 01:11:06 1.0
%%sql
SELECT
    passenger_count, AVG(trip_distance) AS avg_trip_distance
FROM "yellow_tripdata_2021-01.parquet"
GROUP BY passenger_count
ORDER BY passenger_count ASC
*  duckdb://
Done.
passenger_count avg_trip_distance
None 29.665125772734566
0.0 2.5424466811344635
1.0 2.6805563237139625
2.0 2.7948325921160815
3.0 2.757641060657793
4.0 2.8681984015618216
5.0 2.6940995207307994
6.0 2.5745177825092593
7.0 11.134
8.0 1.05

Plot#

%%sql avg_trip_distance <<
SELECT
    passenger_count, AVG(trip_distance) AS avg_trip_distance
FROM "yellow_tripdata_2021-01.parquet"
GROUP BY passenger_count
ORDER BY passenger_count ASC
*  duckdb://
Done.
ax = avg_trip_distance.plot()
# customize plot (this is a matplotlib Axes object)
_ = ax.set_title("Avg trip distance by num of passengers")
../_images/200c894d121ba788cba077f9fa1c4c50901e03a414d7f668695ed7bd8e830d54.png

Reading from a SQLite database#

If you have a large SQlite database, you can use DuckDB to perform analytical queries it with much better performance.

%load_ext sql
The sql extension is already loaded. To reload it, use:
  %reload_ext sql
import urllib.request
from pathlib import Path

# download sample database
if not Path("my.db").is_file():
    url = "https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"  # noqa
    urllib.request.urlretrieve(url, "my.db")

We’ll use sqlite_scanner extension to load a sample SQLite database into DuckDB:

%%sql duckdb:///
INSTALL 'sqlite_scanner';
LOAD 'sqlite_scanner';
CALL sqlite_attach('my.db');
Done.
Done.
Done.
Success
%%sql
SELECT * FROM track LIMIT 5
   duckdb://
*  duckdb:///
Done.
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
1 For Those About To Rock (We Salute You) 1 1 1 Angus Young, Malcolm Young, Brian Johnson 343719 11170334 0.99
2 Balls to the Wall 2 2 1 None 342562 5510424 0.99
3 Fast As a Shark 3 2 1 F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 230619 3990994 0.99
4 Restless and Wild 3 2 1 F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman 252051 4331779 0.99
5 Princess of the Dawn 3 2 1 Deaffy & R.A. Smith-Diesel 375418 6290521 0.99

Plotting large datasets#

New in version 0.5.2.

This section demonstrates how we can efficiently plot large datasets with DuckDB and JupySQL without blowing up our machine’s memory. %sqlplot performs all aggregations in DuckDB.

Let’s install the required package:

%pip install jupysql duckdb duckdb-engine pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.

Now, we download a sample data: NYC Taxi data splitted in 3 parquet files:

N_MONTHS = 3

# https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
for i in range(1, N_MONTHS + 1):
    filename = f"yellow_tripdata_2021-{str(i).zfill(2)}.parquet"
    if not Path(filename).is_file():
        print(f"Downloading: {filename}")
        url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{filename}"
        urllib.request.urlretrieve(url, filename)
Downloading: yellow_tripdata_2021-02.parquet
Downloading: yellow_tripdata_2021-03.parquet

In total, this contains more then 4.6M observations:

%%sql
SELECT count(*) FROM 'yellow_tripdata_2021-*.parquet'
   duckdb://
*  duckdb:///
Done.
count_star()
4666630

Let’s use JupySQL to get a histogram of trip_distance across all 12 files:

%sqlplot histogram --table yellow_tripdata_2021-*.parquet --column trip_distance --bins 50
<Axes: title={'center': "'trip_distance' from 'yellow_tripdata_2021-*.parquet'"}, xlabel='trip_distance', ylabel='Count'>
../_images/c64ab6385879bec656616a84a220db3eb0048bf5e0d4ab886e9475462777274c.png

We have some outliers, let’s find the 99th percentile:

%%sql
SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance)
FROM 'yellow_tripdata_2021-*.parquet'
   duckdb://
*  duckdb:///
Done.
quantile_disc(0.99 ORDER BY trip_distance)
18.93

We now write a query to remove everything above that number:

%%sql --save no_outliers --no-execute
SELECT trip_distance
FROM 'yellow_tripdata_2021-*.parquet'
WHERE trip_distance < 18.93
   duckdb://
*  duckdb:///
Skipping execution...

Histogram#

%sqlplot histogram --table no_outliers --column trip_distance --bins 50 --with no_outliers
<Axes: title={'center': "'trip_distance' from 'no_outliers'"}, xlabel='trip_distance', ylabel='Count'>
../_images/b58548b049149ff9e3b4abaf0bec46ac7da42e0257e68aa03cac4cb5479e6197.png

Boxplot#

%sqlplot boxplot --table no_outliers --column trip_distance --with no_outliers
<Axes: title={'center': "'trip_distance' from 'no_outliers'"}, ylabel='trip_distance'>
../_images/3d6e7b879c489af69231374ff0dce9cdcc61124a9d9bd32e7dba78d8b9b7b835.png

Querying existing dataframes#

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("duckdb:///:memory:")
engine.execute("register", ("df", pd.DataFrame({"x": range(100)})))
<sqlalchemy.engine.cursor.LegacyCursorResult at 0x7f47ce453190>
%sql engine
%%sql
SELECT *
FROM df
WHERE x > 95
   duckdb://
   duckdb:///
*  duckdb:///:memory:
Done.
x
96
97
98
99

Passing parameters to connection#

from sqlalchemy import create_engine

some_engine = create_engine(
    "duckdb:///:memory:",
    connect_args={
        "preload_extensions": ["excel"],
    },
)
%sql some_engine

Listing Tables#

This section demonstrates how to list tables from both the .csv and .parquet files introduced in the previous sections.

Listing tables from a .csv file#

The data from the .csv file must first be registered as a table in order for the table to be listed.

%%sql
CREATE TABLE penguins AS SELECT * FROM penguins.csv
   duckdb://
   duckdb:///
*  duckdb:///:memory:
Done.
Count
344

The cell above allows the data to now be listed as a table from the following code:

%sqlcmd tables
Name
penguins

Listing tables from a .parquet file#

Identically, to list the data from a .parquet file as a table, the data must first be registered as a table.

%%sql
CREATE TABLE tripdata AS SELECT * FROM "yellow_tripdata_2021-01.parquet"
   duckdb://
   duckdb:///
*  duckdb:///:memory:
Done.
Count
1369769

The data is now able to be listed as a table from the following code:

%sqlcmd tables
Name
tripdata
penguins

Listing Columns#

After either registering the data from the.csv or .parquet files as a table, their respective columns can now be listed with the following code:

%sqlcmd columns -t penguins
name type nullable default autoincrement comment
species VARCHAR True None False None
island VARCHAR True None False None
bill_length_mm DOUBLE_PRECISION True None False None
bill_depth_mm DOUBLE_PRECISION True None False None
flipper_length_mm BIGINT True None False None
body_mass_g BIGINT True None False None
sex VARCHAR True None False None
%sqlcmd columns -t tripdata
name type nullable default autoincrement comment
VendorID BIGINT True None False None
tpep_pickup_datetime TIMESTAMP True None False None
tpep_dropoff_datetime TIMESTAMP True None False None
passenger_count DOUBLE_PRECISION True None False None
trip_distance DOUBLE_PRECISION True None False None
RatecodeID DOUBLE_PRECISION True None False None
store_and_fwd_flag VARCHAR True None False None
PULocationID BIGINT True None False None
DOLocationID BIGINT True None False None
payment_type BIGINT True None False None
fare_amount DOUBLE_PRECISION True None False None
extra DOUBLE_PRECISION True None False None
mta_tax DOUBLE_PRECISION True None False None
tip_amount DOUBLE_PRECISION True None False None
tolls_amount DOUBLE_PRECISION True None False None
improvement_surcharge DOUBLE_PRECISION True None False None
total_amount DOUBLE_PRECISION True None False None
congestion_surcharge DOUBLE_PRECISION True None False None
airport_fee DOUBLE_PRECISION True None False None