%sqlcmd profile#

%sqlcmd profile allows you to obtain summary statistics of a table quickly. The code used here is compatible with all major databases.

Note

You can view the documentation and command line arguments by running %sqlcmd?

Arguments:

-t/--table (Required) Get the profile of a table

-s/--schema (Optional) Get the profile of a table under a specified schema

-o/--output (Optional) Output the profile at a specified location (path name expected)

Note

This example requires duckdb-engine: pip install duckdb-engine

Load CSV Data with DuckDB#

Load the extension and connect to an in-memory DuckDB database:

%load_ext sql
%sql duckdb://
Connecting to 'duckdb://'

Load and download penguins.csv dataset , using DuckDB.

from pathlib import Path
from urllib.request import urlretrieve

if not Path("penguins.csv").is_file():
    urlretrieve(
        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
        "penguins.csv",
    )
%%sql
SELECT * FROM "penguins.csv" LIMIT 3
Running query in 'duckdb://'
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie Torgersen 39.1 18.7 181 3750 MALE
Adelie Torgersen 39.5 17.4 186 3800 FEMALE
Adelie Torgersen 40.3 18.0 195 3250 FEMALE

Load Parquet Data with DuckDB#

Load and download a sample dataset that contains historical taxi data from NYC, using DuckDB.

import os
from pathlib import Path
from urllib.request import urlretrieve

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
new_filename = "yellow_tripdata_2021.parquet"

if not Path(new_filename).is_file():
    urlretrieve(url, new_filename)
    # Rename the downloaded file to remove the month ("-" interferes with the SQL query)
    os.rename(new_filename, new_filename.replace("-01", ""))
%%sql
SELECT * FROM yellow_tripdata_2021.parquet LIMIT 3
Running query in 'duckdb://'
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
1 2021-01-01 00:30:10 2021-01-01 00:36:12 1.0 2.1 1.0 N 142 43 2 8.0 3.0 0.5 0.0 0.0 0.3 11.8 2.5 None
1 2021-01-01 00:51:20 2021-01-01 00:52:19 1.0 0.2 1.0 N 238 151 2 3.0 0.5 0.5 0.0 0.0 0.3 4.3 0.0 None
1 2021-01-01 00:43:30 2021-01-01 01:11:06 1.0 14.7 1.0 N 132 165 1 42.0 0.5 0.5 8.65 0.0 0.3 51.95 0.0 None

Profile#

Let us profile the penguins.csv data

%sqlcmd profile --table "penguins.csv"
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
count 344 344 342 342 342 342 333
unique 3 3 164 80 55 94 2
top Adelie Biscoe nan nan nan nan MALE
freq 152 168 nan nan nan nan 168
mean nan nan 43.9219 17.1512 200.9152 4201.7544 nan
std nan nan 5.4516 1.9719 14.0411 800.7812 nan
min nan nan 32.1 13.1 172 2700 nan
25% nan nan 39.2000 15.6000 190.0000 3550.0000 nan
50% nan nan 44.4000 17.3000 197.0000 4050.0000 nan
75% nan nan 48.5000 18.7000 213.0000 4750.0000 nan
max nan nan 59.6 21.5 231 6300 nan

Let us profile the yellow_tripdata_2021.parquet data

%sqlcmd profile --table "yellow_tripdata_2021.parquet"
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
count 1369769 1369769 1369769 1271417 1369769 1271417 1271417 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1271417 5
unique 3 939020 935992 9 3787 7 2 258 260 5 6017 91 3 2155 307 3 8321 5 1
top nan 2021-01-14 13:52:00 2021-01-27 00:00:00 nan nan nan N nan nan nan nan nan nan nan nan nan nan nan 0.0
freq nan 13 21 nan nan nan 1252433 nan nan nan nan nan nan nan nan nan nan nan 5
mean 1.7217 nan nan 1.4115 4.6320 1.0351 nan 165.2474 161.4957 1.1886 12.0966 0.9705 0.4930 1.9181 0.2477 0.2969 17.4744 2.2390 nan
std 0.5925 nan nan 1.0598 393.9035 0.5995 nan 67.8385 72.1079 0.5777 12.9134 1.2313 0.0763 2.5972 1.6728 0.0422 14.6934 0.7989 nan
min 1 nan nan 0.0 0.0 1.0 nan 1 1 0 -490.0 -5.5 -0.5 -100.0 -31.12 -0.3 -492.8 -2.5 nan
25% 1.0000 nan nan 1.0000 1.0000 1.0000 nan 124.0000 107.0000 1.0000 6.0000 0.0000 0.5000 0.0000 0.0000 0.3000 10.8000 2.5000 nan
50% 2.0000 nan nan 1.0000 1.7000 1.0000 nan 162.0000 162.0000 1.0000 8.5000 0.0000 0.5000 1.8600 0.0000 0.3000 13.8000 2.5000 nan
75% 2.0000 nan nan 1.0000 3.0200 1.0000 nan 236.0000 236.0000 1.0000 13.5000 2.5000 0.5000 2.7500 0.0000 0.3000 19.1200 2.5000 nan
max 6 nan nan 8.0 263163.28 99.0 nan 265 265 4 6960.5 8.25 0.5 1140.44 811.75 0.3 7661.28 3.0 nan

Saving report as HTML#

To save the generated report as an HTML file, use the --output/-o attribute followed by the desired file name.

To save the profile of the penguins.csv data as an HTML file:

%sqlcmd profile --table "penguins.csv" --output penguins-report.html
Hide code cell output
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
count 344 344 342 342 342 342 333
unique 3 3 164 80 55 94 2
top Adelie Biscoe nan nan nan nan MALE
freq 152 168 nan nan nan nan 168
mean nan nan 43.9219 17.1512 200.9152 4201.7544 nan
std nan nan 5.4516 1.9719 14.0411 800.7812 nan
min nan nan 32.1 13.1 172 2700 nan
25% nan nan 39.2000 15.6000 190.0000 3550.0000 nan
50% nan nan 44.4000 17.3000 197.0000 4050.0000 nan
75% nan nan 48.5000 18.7000 213.0000 4750.0000 nan
max nan nan 59.6 21.5 231 6300 nan
from IPython.display import HTML

HTML("penguins-report.html")
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
count 344 344 342 342 342 342 333
unique 3 3 164 80 55 94 2
top Adelie Biscoe nan nan nan nan MALE
freq 152 168 nan nan nan nan 168
mean nan nan 43.9219 17.1512 200.9152 4201.7544 nan
std nan nan 5.4516 1.9719 14.0411 800.7812 nan
min nan nan 32.1 13.1 172 2700 nan
25% nan nan 39.2000 15.6000 190.0000 3550.0000 nan
50% nan nan 44.4000 17.3000 197.0000 4050.0000 nan
75% nan nan 48.5000 18.7000 213.0000 4750.0000 nan
max nan nan 59.6 21.5 231 6300 nan

To save the profile of the yellow_tripdata_2021.parquet data as an HTML file:

%sqlcmd profile --table "yellow_tripdata_2021.parquet" --output taxi-report.html
Hide code cell output
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
count 1369769 1369769 1369769 1271417 1369769 1271417 1271417 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1271417 5
unique 3 939020 935992 9 3787 7 2 258 260 5 6017 91 3 2155 307 3 8321 5 1
top nan 2021-01-14 13:52:00 2021-01-09 11:01:04 nan nan nan N nan nan nan nan nan nan nan nan nan nan nan 0.0
freq nan 13 21 nan nan nan 1252433 nan nan nan nan nan nan nan nan nan nan nan 5
mean 1.7217 nan nan 1.4115 4.6320 1.0351 nan 165.2474 161.4957 1.1886 12.0966 0.9705 0.4930 1.9181 0.2477 0.2969 17.4744 2.2390 nan
std 0.5925 nan nan 1.0598 393.9035 0.5995 nan 67.8385 72.1079 0.5777 12.9134 1.2313 0.0763 2.5972 1.6728 0.0422 14.6934 0.7989 nan
min 1 nan nan 0.0 0.0 1.0 nan 1 1 0 -490.0 -5.5 -0.5 -100.0 -31.12 -0.3 -492.8 -2.5 nan
25% 1.0000 nan nan 1.0000 1.0000 1.0000 nan 124.0000 107.0000 1.0000 6.0000 0.0000 0.5000 0.0000 0.0000 0.3000 10.8000 2.5000 nan
50% 2.0000 nan nan 1.0000 1.7000 1.0000 nan 162.0000 162.0000 1.0000 8.5000 0.0000 0.5000 1.8600 0.0000 0.3000 13.8000 2.5000 nan
75% 2.0000 nan nan 1.0000 3.0200 1.0000 nan 236.0000 236.0000 1.0000 13.5000 2.5000 0.5000 2.7500 0.0000 0.3000 19.1200 2.5000 nan
max 6 nan nan 8.0 263163.28 99.0 nan 265 265 4 6960.5 8.25 0.5 1140.44 811.75 0.3 7661.28 3.0 nan
from IPython.display import HTML

HTML("taxi-report.html")
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
count 1369769 1369769 1369769 1271417 1369769 1271417 1271417 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1271417 5
unique 3 939020 935992 9 3787 7 2 258 260 5 6017 91 3 2155 307 3 8321 5 1
top nan 2021-01-14 13:52:00 2021-01-09 11:01:04 nan nan nan N nan nan nan nan nan nan nan nan nan nan nan 0.0
freq nan 13 21 nan nan nan 1252433 nan nan nan nan nan nan nan nan nan nan nan 5
mean 1.7217 nan nan 1.4115 4.6320 1.0351 nan 165.2474 161.4957 1.1886 12.0966 0.9705 0.4930 1.9181 0.2477 0.2969 17.4744 2.2390 nan
std 0.5925 nan nan 1.0598 393.9035 0.5995 nan 67.8385 72.1079 0.5777 12.9134 1.2313 0.0763 2.5972 1.6728 0.0422 14.6934 0.7989 nan
min 1 nan nan 0.0 0.0 1.0 nan 1 1 0 -490.0 -5.5 -0.5 -100.0 -31.12 -0.3 -492.8 -2.5 nan
25% 1.0000 nan nan 1.0000 1.0000 1.0000 nan 124.0000 107.0000 1.0000 6.0000 0.0000 0.5000 0.0000 0.0000 0.3000 10.8000 2.5000 nan
50% 2.0000 nan nan 1.0000 1.7000 1.0000 nan 162.0000 162.0000 1.0000 8.5000 0.0000 0.5000 1.8600 0.0000 0.3000 13.8000 2.5000 nan
75% 2.0000 nan nan 1.0000 3.0200 1.0000 nan 236.0000 236.0000 1.0000 13.5000 2.5000 0.5000 2.7500 0.0000 0.3000 19.1200 2.5000 nan
max 6 nan nan 8.0 263163.28 99.0 nan 265 265 4 6960.5 8.25 0.5 1140.44 811.75 0.3 7661.28 3.0 nan

Use schemas with DuckDB#

To profile a specific table from various tables in different schemas, we can use the --schema/-s attribute.

Let’s save the file penguins.csv as a table penguins under the schema s1.

%%sql 
DROP TABLE IF EXISTS penguins;
CREATE SCHEMA IF NOT EXISTS s1;
CREATE TABLE s1.penguins (
    species VARCHAR(255),
    island VARCHAR(255),
    bill_length_mm DECIMAL(5, 2),
    bill_depth_mm DECIMAL(5, 2),
    flipper_length_mm DECIMAL(5, 2),
    body_mass_g INTEGER,
    sex VARCHAR(255)
);
COPY s1.penguins FROM 'penguins.csv' WITH (FORMAT CSV, HEADER TRUE);
Running query in 'duckdb://'
Count
%sqlcmd profile --table penguins --schema s1 
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
count 344 344 342 342 342 342 333
unique 3 3 164 80 55 94 2
top Adelie Biscoe 41.10 17.00 190.00 nan MALE
freq 152 168 7 12 22 nan 168
mean nan nan nan nan nan 4201.7544 nan
std nan nan nan nan nan 800.7812 nan
min nan nan nan nan nan 2700 nan
25% nan nan nan nan nan 3550.0000 nan
50% nan nan nan nan nan 4050.0000 nan
75% nan nan nan nan nan 4750.0000 nan
max nan nan nan nan nan 6300 nan

Use schemas with SQLite#

%%sql duckdb:///
INSTALL 'sqlite_scanner';
LOAD 'sqlite_scanner';
Connecting and switching to connection 'duckdb:///'
Success
import sqlite3

with sqlite3.connect("a.db") as conn:
    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
    conn.execute("INSERT INTO my_numbers VALUES (1)")
    conn.execute("INSERT INTO my_numbers VALUES (2)")
    conn.execute("INSERT INTO my_numbers VALUES (3)")
%%sql
ATTACH DATABASE 'a.db' AS a_schema
Running query in 'duckdb:///'
Success
import sqlite3

with sqlite3.connect("b.db") as conn:
    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
    conn.execute("INSERT INTO my_numbers VALUES (11)")
    conn.execute("INSERT INTO my_numbers VALUES (22)")
    conn.execute("INSERT INTO my_numbers VALUES (33)")
%%sql
ATTACH DATABASE 'b.db' AS b_schema
Running query in 'duckdb:///'
Success

Let’s profile my_numbers of b_schema

%sqlcmd profile --table my_numbers --schema b_schema
number
count 3
unique 3
top nan
freq nan
mean 22.0000
std 8.9815
min 11.0
25% 11.0000
50% 22.0000
75% 33.0000
max 33.0

Parametrizing arguments#

JupySQL supports variable expansion of arguments in the form of {{variable}}. This allows the user to specify arguments with placeholders that can be replaced by variables dynamically.

Let’s look at an example that uses variable expansion for table, schema and output arguments:

table = "my_numbers"
schema = "b_schema"
output = "numbers-report.html"
%sqlcmd profile --table {{table}} --schema {{schema}} --output {{output}}
Hide code cell output
number
count 3
unique 3
top nan
freq nan
mean 22.0000
std 8.9815
min 11.0
25% 11.0000
50% 22.0000
75% 33.0000
max 33.0
from IPython.display import HTML

HTML(output)
number
count 3
unique 3
top nan
freq nan
mean 22.0000
std 8.9815
min 11.0
25% 11.0000
50% 22.0000
75% 33.0000
max 33.0