Plotting#

New in version 0.5.2: %sqlplot was introduced in 0.5.2; however, the underlying Python API was introduced in 0.4.4

The most common way for plotting datasets in Python is to load them using pandas and then use matplotlib or seaborn for plotting. This approach requires loading all your data into memory which is highly inefficient, since you can easily run out of memory as you perform data transformations.

The plotting module in JupySQL runs computations in the SQL engine (database, warehouse, or embedded engine). This delegates memory management to the engine and ensures that intermediate computations do not keep eating up memory, allowing you to efficiently plot massive datasets. There are two primary use cases:

1. Plotting large remote tables

If your data is stored in a data warehouse such as Snowflake, Redshift, or BigQuery, downloading entire tables locally is extremely inefficient, and you might not even have enough memory in your laptop to load the entire dataset. With JupySQL, the data is aggregated and summarized in the warehouse, and only the summary statistics are fetched over the network. Keeping memory usage at minimum and allowing you to quickly plot entire warehouse tables efficiently.

2. Plotting large local files

If you have large .csv or .parquet files, plotting them locally is challenging. You might not have enough memory in your laptop. Furthermore, as you transform your data, those transformed datasets will consume memory, making it even more challenging. With JupySQL, loading, aggregating, and summarizing is performed in DuckDB, an embedded SQL engine; allowing you to plot larger-than-memory datasets from your laptop.

Download data#

In this example, we’ll demonstrate this second use case and query a .parquet file using DuckDB. However, the same code applies for plotting data stored in a database or data warehoouse such as Snowflake, Redshift, BigQuery, PostgreSQL, etc.

from pathlib import Path
from urllib.request import urlretrieve

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"

if not Path("yellow_tripdata_2021-01.parquet").is_file():
    urlretrieve(url, "yellow_tripdata_2021-01.parquet")

Setup#

Note

%sqlplot requires matplotlib: pip install matplotlib and this example requires duckdb-engine: pip install duckdb-engine

Load the extension and connect to an in-memory DuckDB database:

%load_ext sql
%sql duckdb://

We’ll be using a sample dataset that contains historical taxi data from NYC:

Data preview#

%%sql
SELECT * FROM "yellow_tripdata_2021-01.parquet" LIMIT 3
Running query in 'duckdb://'
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
1 2021-01-01 00:30:10 2021-01-01 00:36:12 1.0 2.1 1.0 N 142 43 2 8.0 3.0 0.5 0.0 0.0 0.3 11.8 2.5 None
1 2021-01-01 00:51:20 2021-01-01 00:52:19 1.0 0.2 1.0 N 238 151 2 3.0 0.5 0.5 0.0 0.0 0.3 4.3 0.0 None
1 2021-01-01 00:43:30 2021-01-01 01:11:06 1.0 14.7 1.0 N 132 165 1 42.0 0.5 0.5 8.65 0.0 0.3 51.95 0.0 None
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()
%%sql
SELECT COUNT(*) FROM "yellow_tripdata_2021-01.parquet"
Running query in 'duckdb://'
count_star()
1369769
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Boxplot#

Note

To use %sqlplot boxplot, your SQL engine must support:

percentile_disc(...) WITHIN GROUP (ORDER BY ...)

Snowflake, Postgres, DuckDB, and others support this.

To create a boxplot, call %sqlplot boxplot, and pass the name of the table, and the column you want to plot. Since we’re using DuckDB for this example, the table is the path to the parquet file.

%sqlplot boxplot --table yellow_tripdata_2021-01.parquet --column trip_distance
<Axes: title={'center': "'trip_distance' from 'yellow_tripdata_2021-01.parquet'"}, ylabel='trip_distance'>
_images/c22ed97070224b999501ca063fc98adb17941a1b3576f5323bd24d131e964b45.png

There are many outliers in the data, let’s find the 90th percentile to use it as cutoff value, this will allow us to create a cleaner visualization:

%%sql
SELECT percentile_disc(0.90) WITHIN GROUP (ORDER BY trip_distance),
FROM 'yellow_tripdata_2021-01.parquet'
Running query in 'duckdb://'
quantile_disc(0.90 ORDER BY trip_distance)
6.3
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Now, let’s create a query that filters by the 90th percentile. Note that we’re using the --save, and --no-execute functions. This tells JupySQL to store the query, but skips execution. We’ll reference it in our next plotting call.

%%sql --save short_trips --no-execute
SELECT *
FROM "yellow_tripdata_2021-01.parquet"
WHERE trip_distance < 6.3
Running query in 'duckdb://'
Skipping execution...

Now, let’s plot again, but this time let’s pass --table short_trips. Note that this table doesn’t exist; JupySQL will automatically infer and use the saved snippet defined above.

%sqlplot boxplot --table short_trips --column trip_distance
Plotting using saved snippet : short_trips
<Axes: title={'center': "'trip_distance' from 'short_trips'"}, ylabel='trip_distance'>
_images/3ddced877aadc58df9349932f85ce6ff36d7d721cff9c6c596063003184b0abd.png

We can see the highest value is a bit over 6, that’s expected since we set a 6.3 cutoff value.

If you wish to specify the saved snippet explicitly, please use the --with argument. Click here for more details on when to specify --with explicitly.

%sqlplot boxplot --table short_trips --column trip_distance --with short_trips
<Axes: title={'center': "'trip_distance' from 'short_trips'"}, ylabel='trip_distance'>
_images/3ddced877aadc58df9349932f85ce6ff36d7d721cff9c6c596063003184b0abd.png

Histogram#

To create a histogram, call %sqlplot histogram, and pass the name of the table, the column you want to plot, and the number of bins. Similarly to what we did in the Boxplot example, JupySQL detects a saved snippet and only plots such data subset.

%sqlplot histogram --table short_trips --column trip_distance --bins 10
Plotting using saved snippet : short_trips
<Axes: title={'center': "'trip_distance' from 'short_trips'"}, xlabel='trip_distance', ylabel='Count'>
_images/c765ff7200e4e8713cbd44dab395f9e755bfa0f70abbb70ff31fbd619950b062.png

Customize plot#

%sqlplot returns a matplotlib.Axes object that you can further customize:

ax = %sqlplot histogram --table short_trips --column trip_distance --bins 50
ax.grid()
ax.set_title("Trip distance from trips < 6.3")
_ = ax.set_xlabel("Trip distance")
Plotting using saved snippet : short_trips
_images/5369e8e69dd8da0f4a56ebe3d71feff391e98506bb9cee84217b391487c6e16f.png

Bar plot#

To create a bar plot, call %sqlplot bar, and pass the name of the table and the column you want to plot. We will use the snippet created in the Boxplot example and JupySQL will plot for that subset of data.

%sqlplot bar --table short_trips --column payment_type
Plotting using saved snippet : short_trips
Removing NULLs, if there exists any from payment_type
<Axes: title={'center': 'short_trips'}, xlabel='payment_type', ylabel='Count'>
_images/ddbe1866ba587854be53ddd3f19fb005935b63f819c554eea40dafa638f13544.png

Pie plot#

To create a pie plot, call %sqlplot pie, and pass the name of the table and the column you want to plot. We will reuse the code snippet from the previous example on Boxplot, and JupySQL will generate a plot for that specific subset of data.

%sqlplot pie --table short_trips --column payment_type
Plotting using saved snippet : short_trips
Removing NULLs, if there exists any from payment_type
<Axes: title={'center': 'short_trips'}>
_images/1a7fc4896d4bf6dbb04c3ce22a8d4847673d39d86684a2e09ad0c0d39759a68e.png