%sqlplot

`%sqlplot`#

New in version 0.5.2.

Note

%sqlplot requires matplotlib: pip install matplotlib and this example requires duckdb-engine: pip install duckdb-engine

%load_ext sql

%sql duckdb://

Connecting to 'duckdb://'

from pathlib import Path
from urllib.request import urlretrieve

if not Path("penguins.csv").is_file():
    urlretrieve(
        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
        "penguins.csv",
    )

%%sql
SELECT * FROM "penguins.csv" LIMIT 3

Running query in 'duckdb://'

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Adelie	Torgersen	39.1	18.7	181	3750	MALE
Adelie	Torgersen	39.5	17.4	186	3800	FEMALE
Adelie	Torgersen	40.3	18.0	195	3250	FEMALE

Note

You can view the documentation and command line arguments by running %sqlplot?

`%sqlplot boxplot`#

Note

To use %sqlplot boxplot, your SQL engine must support:

percentile_disc(...) WITHIN GROUP (ORDER BY ...)

Snowflake, Postgres, DuckDB, and others support this.

Shortcut: %sqlplot box

-t/--table Table to use (if using DuckDB: path to the file to query)

-s/--schema Schema to use (No need to pass if using a default schema)

-c/--column Column(s) to plot. You might pass one than one value (e.g., -c a b c)

-o/--orient Boxplot orientation (h for horizontal, v for vertical)

-w/--with Use a previously saved query as input data

%sqlplot boxplot --table penguins.csv --column body_mass_g

<Axes: title={'center': "'body_mass_g' from 'penguins.csv'"}, ylabel='body_mass_g'>

../_images/3d7d90b3910074aeca13768a26f0b4001e61844040926e73d9bb69e90378a223.png

Transform data before plotting#

%%sql
SELECT island, COUNT(*)
FROM penguins.csv
GROUP BY island
ORDER BY COUNT(*) DESC

Running query in 'duckdb://'

island	count_star()
Biscoe	168
Dream	124
Torgersen	52

%%sql --save biscoe --no-execute
SELECT *
FROM penguins.csv
WHERE island = 'Biscoe'

Running query in 'duckdb://'

Skipping execution...

Since we saved biscoe from the cell above, we can pass it as an argument to --table since jupysql autogenerates the CTE.

%sqlplot boxplot --table biscoe --column body_mass_g

<Axes: title={'center': "'body_mass_g' from 'biscoe'"}, ylabel='body_mass_g'>

../_images/76c3f854db0b56d2b331042bdd322f451654048a33f301b46de2b2ce1f7182c5.png

Horizontal boxplot#

%sqlplot boxplot --table penguins.csv --column bill_length_mm --orient h

<Axes: title={'center': "'bill_length_mm' from 'penguins.csv'"}, xlabel='bill_length_mm'>

../_images/80fcce83b6a2767d0393b8ccbd136cd27bcb62401d8deefaf2acbb2385e9977d.png

Multiple columns#

%sqlplot boxplot --table penguins.csv --column bill_length_mm bill_depth_mm flipper_length_mm

<Axes: title={'center': "Boxplot from 'penguins.csv'"}>

../_images/d86cb80706874dd16853590ce314aaf2066245467f1fe415bbfc475f692114ac.png

`%sqlplot histogram`#

Shortcut: %sqlplot hist

-t/--table Table to use (if using DuckDB: path to the file to query)

-s/--schema Schema to use (No need to pass if using a default schema)

-c/--column Column to plot

-b/--bins (default: 50) Number of bins

-B/--breaks Custom bin intervals

-W/--binwidth Width of each bin

-w/--with Use a previously saved query as input data

Note

When using -b/–bins, -B/–breaks, or -W/–binwidth, you can only specify one of them. If none of them is specified, the default value for -b/–bins will be used.

Histogram supports NULL values by skipping them. Now we can generate histograms without explicitly removing NULL entries.

New in version 0.7.9.

%sqlplot histogram --table penguins.csv --column body_mass_g

<Axes: title={'center': "'body_mass_g' from 'penguins.csv'"}, xlabel='body_mass_g', ylabel='Count'>

../_images/05119926676fc93281f7cadfe669bf48c1127eb5f8c4aa0e8be34f230e32d5d0.png

When plotting a histogram, it divides a range with the number of bins - 1 to calculate a bin size. Then, it applies round half down relative to the bin size and categorizes continuous values into bins to replicate right closed intervals from the ggplot histogram in R.

body_mass_g

Specifying bins#

Bins allow you to set the number of bins in a histogram, and it’s useful when you are interested in the overall distribution.

%sqlplot histogram --table penguins.csv --column body_mass_g  --bins 100

<Axes: title={'center': "'body_mass_g' from 'penguins.csv'"}, xlabel='body_mass_g', ylabel='Count'>

../_images/5b4f1b882e9e6b459e169a2c41a66e2308d781995947b8bf6259854fdf1a7cbd.png

Specifying breaks#

Breaks allow you to set custom intervals for a histogram. It is useful when you want to view distribution within a specific range. You can specify breaks by passing desired each end and break points separated by whitespace after -B/--breaks. Since those break points define a range of data points to plot, bar width, and number of bars in a histogram, make sure to pass more than 1 point that is strictly increasing and includes at least one data point.

%sqlplot histogram --table penguins.csv --column body_mass_g --breaks 3200 3400 3600 3800 4000 4200 4400 4600 4800

<Axes: title={'center': "'body_mass_g' from 'penguins.csv'"}, xlabel='body_mass_g', ylabel='Count'>

../_images/cb2e50d41f458aa150db65eb093d46dfdd8ea07db8c370378616e2bed096cb50.png

Specifying binwidth#

Binwidth allows you to set the width of bins in a histogram. It is useful when you directly aim to adjust the granularity of the histogram. To specify the binwidth, pass a desired width after -W/--binwidth. Since the binwidth determines details of distribution, make sure to pass a suitable positive numeric value based on your data.

%sqlplot histogram --table penguins.csv --column body_mass_g --binwidth 150

<Axes: title={'center': "'body_mass_g' from 'penguins.csv'"}, xlabel='body_mass_g', ylabel='Count'>

../_images/478c631677960b3036025071624bcdc58973994713ac341157f2dc24d500255d.png

Multiple columns#

%sqlplot histogram --table penguins.csv --column bill_length_mm bill_depth_mm

<Axes: title={'center': "Histogram from 'penguins.csv'"}, ylabel='Count'>

../_images/c70e6e2abd261f2c6c10444de535524e54f59f5a22496241c268f5f29b549aa7.png

Customize plot#

%sqlplot returns a matplotlib.Axes object.

ax = %sqlplot histogram --table penguins.csv --column body_mass_g
ax.set_title("Body mass (grams)")
_ = ax.grid()

../_images/b36397ea18ff38393d3915b507d5af41617f3f318136546622100f8a86488499.png

`%sqlplot bar`#

New in version 0.7.6.

Shortcut: %sqlplot bar

-t/--table Table to use (if using DuckDB: path to the file to query)

-s/--schema Schema to use (No need to pass if using a default schema)

-c/--column Column to plot.

-o/--orient Barplot orientation (h for horizontal, v for vertical)

-w/--with Use a previously saved query as input data

-S/--show-numbers Show numbers on top of the bar

Bar plot does not support NULL values, so we automatically remove them, when plotting.

%sqlplot bar --table penguins.csv --column species

Removing NULLs, if there exists any from species

<Axes: title={'center': 'penguins.csv'}, xlabel='species', ylabel='Count'>

../_images/bf8ef44d48378ba359af3ae0abc368d87bf71273e816417478221e726b3f31d9.png

You can additionally pass two columns to bar plot i.e. x and height columns.

%%sql --save add_col --no-execute
SELECT species, count(species) as cnt
FROM penguins.csv
group by species

Running query in 'duckdb://'

Skipping execution...

%sqlplot bar --table add_col --column species cnt

Removing NULLs, if there exists any from species and cnt

<Axes: title={'center': 'add_col'}, xlabel='species', ylabel='cnt'>

../_images/800b27c7dbb29da23561093b8eaaf5db88abfece1ea7c46cd4c342d1834335ad.png

You can also pass the orientation using the orient argument.

%sqlplot bar --table add_col --column species cnt --orient h

Removing NULLs, if there exists any from species and cnt

<Axes: title={'center': 'add_col'}, xlabel='cnt', ylabel='species'>

../_images/6d234a8f5e21111bfd88557b0c4102961dcdab078ab92b619c30392c0b719c4e.png

You can also show the number on top of the bar using the S/show-numbers argument.

%sqlplot bar --table penguins.csv --column species -S

Removing NULLs, if there exists any from species

<Axes: title={'center': 'penguins.csv'}, xlabel='species', ylabel='Count'>

../_images/76c92e207b029e613e56cf97c027997288d568b9b70543ffe7df64f26a71dc9b.png

`%sqlplot pie`#

New in version 0.7.6.

Shortcut: %sqlplot pie

-t/--table Table to use (if using DuckDB: path to the file to query)

-s/--schema Schema to use (No need to pass if using a default schema)

-c/--column Column to plot

-w/--with Use a previously saved query as input data

-S/--show-numbers Show the percentage on top of the pie

Pie chart does not support NULL values, so we automatically remove them, when plotting the pie chart.

%sqlplot pie --table penguins.csv --column species

Removing NULLs, if there exists any from species

<Axes: title={'center': 'penguins.csv'}>

../_images/e5e856fac5b6ecb6073ae16e5670c124f7120a283bafaabd5bbbfd479fa4cc96.png

You can additionally pass two columns to bar plot i.e. labels and x columns.

%%sql --save add_col --no-execute
SELECT species, count(species) as cnt
FROM penguins.csv
group by species

Running query in 'duckdb://'

Skipping execution...

%sqlplot pie --table add_col --column species cnt

Removing NULLs, if there exists any from species and cnt

<Axes: title={'center': 'add_col'}>

../_images/28b42fac1b06b9f18a9d2e1010020e60ba701a7f592bcc9e0e3542053d6403e2.png

Here, species is the labels column and cnt is the x column.

You can also show the percentage on top of the pie using the S/show-numbers argument.

%sqlplot pie --table penguins.csv --column species -S

Removing NULLs, if there exists any from species

<Axes: title={'center': 'penguins.csv'}>

../_images/d012749ca0a6b385b817656780bae730ebdf211cc8649fbfdaeecc867825849d.png

Parametrizing arguments#

JupySQL supports variable expansion of arguments in the form of {{variable}}. This allows the user to specify arguments with placeholders that can be replaced by variables dynamically.

%%sql 
DROP TABLE IF EXISTS penguins;
CREATE SCHEMA IF NOT EXISTS s1;
CREATE TABLE s1.penguins (
    species VARCHAR(255),
    island VARCHAR(255),
    bill_length_mm DECIMAL(5, 2),
    bill_depth_mm DECIMAL(5, 2),
    flipper_length_mm DECIMAL(5, 2),
    body_mass_g INTEGER,
    sex VARCHAR(255)
);
COPY s1.penguins FROM 'penguins.csv' WITH (FORMAT CSV, HEADER TRUE);

Running query in 'duckdb://'

Count

table = "penguins"
schema = "s1"
orient = "h"
column = "bill_length_mm"

%sqlplot boxplot --table {{table}} --schema {{schema}} --column {{column}} --orient {{orient}}

<Axes: title={'center': "'bill_length_mm' from 'penguins'"}, xlabel='bill_length_mm'>

../_images/375bc6f8ff0867220923580cddd7da9c5fea1ac6868556542f3264748c61df4f.png

Now let’s see another example using --with:

snippet = "gentoo"

%%sql --save {{snippet}}
SELECT * FROM {{schema}}.{{table}} 
WHERE species == 'Gentoo'

Running query in 'duckdb://'

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Gentoo	Biscoe	46.10	13.20	211.00	4500	FEMALE
Gentoo	Biscoe	50.00	16.30	230.00	5700	MALE
Gentoo	Biscoe	48.70	14.10	210.00	4450	FEMALE
Gentoo	Biscoe	50.00	15.20	218.00	5700	MALE
Gentoo	Biscoe	47.60	14.50	215.00	5400	MALE
Gentoo	Biscoe	46.50	13.50	210.00	4550	FEMALE
Gentoo	Biscoe	45.40	14.60	211.00	4800	FEMALE
Gentoo	Biscoe	46.70	15.30	219.00	5200	MALE
Gentoo	Biscoe	43.30	13.40	209.00	4400	FEMALE
Gentoo	Biscoe	46.80	15.40	215.00	5150	MALE

Truncated to displaylimit of 10.

%sqlplot boxplot --table {{snippet}} --with {{snippet}} --column {{column}}

<Axes: title={'center': "'bill_length_mm' from 'gentoo'"}, ylabel='bill_length_mm'>

../_images/59ad8e94058a6637ab96ff41c90fe963e07a8aeba2b1b578d62c0306b957b1c3.png

%sqlplot

Contents

%sqlplot#

%sqlplot boxplot#

Transform data before plotting#

Horizontal boxplot#

Multiple columns#

%sqlplot histogram#

Specifying bins#

Specifying breaks#

Specifying binwidth#

Multiple columns#

Customize plot#

%sqlplot bar#

%sqlplot pie#

Parametrizing arguments#

`%sqlplot`#

`%sqlplot boxplot`#

`%sqlplot histogram`#

`%sqlplot bar`#

`%sqlplot pie`#