%sqlcmd profile#

%sqlcmd profile allows you to obtain summary statistics of a table quickly. The code used here is compatible with all major databases.

Note

You can view the documentation and command line arguments by running %sqlcmd?

Arguments:

-t/--table (Required) Get the profile of a table

-s/--schema (Optional) Get the profile of a table under a specified schema

-o/--output (Optional) Output the profile at a specified location (path name expected)

Note

This example requires duckdb-engine: pip install duckdb-engine

Load CSV Data with DuckDB#

Load the extension and connect to an in-memory DuckDB database:

%load_ext sql
%sql duckdb://
There's a new jupysql version available (0.7.6), you're running 0.7.7.dev0. To upgrade: pip install jupysql --upgrade

Load and download penguins.csv dataset , using DuckDB.

from pathlib import Path
from urllib.request import urlretrieve

if not Path("penguins.csv").is_file():
    urlretrieve(
        "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
        "penguins.csv",
    )
%%sql
SELECT * FROM "penguins.csv" LIMIT 3
*  duckdb://
Done.
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Adelie Torgersen 39.1 18.7 181 3750 MALE
Adelie Torgersen 39.5 17.4 186 3800 FEMALE
Adelie Torgersen 40.3 18.0 195 3250 FEMALE

Load Parquet Data with DuckDB#

Load and download a sample dataset that contains historical taxi data from NYC, using DuckDB.

import os
from pathlib import Path
from urllib.request import urlretrieve

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
new_filename = "yellow_tripdata_2021.parquet"

if not Path(new_filename).is_file():
    urlretrieve(url, new_filename)
    # Rename the downloaded file to remove the month ("-" interferes with the SQL query)
    os.rename(new_filename, new_filename.replace("-01", ""))
%%sql
SELECT * FROM yellow_tripdata_2021.parquet LIMIT 3
*  duckdb://
Done.
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
1 2021-01-01 00:30:10 2021-01-01 00:36:12 1.0 2.1 1.0 N 142 43 2 8.0 3.0 0.5 0.0 0.0 0.3 11.8 2.5 None
1 2021-01-01 00:51:20 2021-01-01 00:52:19 1.0 0.2 1.0 N 238 151 2 3.0 0.5 0.5 0.0 0.0 0.3 4.3 0.0 None
1 2021-01-01 00:43:30 2021-01-01 01:11:06 1.0 14.7 1.0 N 132 165 1 42.0 0.5 0.5 8.65 0.0 0.3 51.95 0.0 None

Profile#

Let us profile the penguins.csv data

%sqlcmd profile --table "penguins.csv"
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
unique 3 3 164 80 55 94 2
top Adelie Biscoe 41.1 17.0 190 3800 MALE
std nan nan 5.452e+00 1.972e+00 1.404e+01 8.008e+02 nan
min Adelie Biscoe 32.1 13.1 172 2700 FEMALE
mean nan nan 4.392e+01 1.715e+01 2.009e+02 4.202e+03 nan
max Gentoo Torgersen 59.6 21.5 231 6300 MALE
freq 152 168 7 12 22 12 168
count 344 344 342 342 342 342 333
75% nan nan 48.5 18.7 213.0 4750.0 nan
50% nan nan 44.4 17.3 197.0 4050.0 nan
25% nan nan 39.2 15.6 190.0 3550.0 nan

Let us profile the yellow_tripdata_2021.parquet data

%sqlcmd profile --table "yellow_tripdata_2021.parquet"
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
unique 3 939020 935992 9 3787 7 2 258 260 5 6017 91 3 2155 307 3 8321 5 1
top 2 2021-01-14 13:52:00 2021-01-27 00:00:00 1.0 0.8 1.0 N 236 236 1 6.0 0.0 0.5 0.0 0.0 0.3 9.3 2.5 0.0
std 5.925e-01 nan nan 1.060e+00 3.939e+02 5.995e-01 nan 6.784e+01 7.211e+01 5.777e-01 1.291e+01 1.231e+00 7.632e-02 2.597e+00 1.673e+00 4.222e-02 1.469e+01 7.989e-01 0.0
min 1 2008-12-31 23:05:14 2008-12-31 23:07:22 0.0 0.0 1.0 N 1 1 0 -490.0 -5.5 -0.5 -100.0 -31.12 -0.3 -492.8 -2.5 0.0
mean 1.722e+00 nan nan 1.412e+00 4.632e+00 1.035e+00 nan 1.652e+02 1.615e+02 1.189e+00 1.210e+01 9.705e-01 4.930e-01 1.918e+00 2.477e-01 2.969e-01 1.747e+01 2.239e+00 0.0
max 6 2021-02-22 16:52:16 2021-02-22 16:56:15 8.0 263163.28 99.0 Y 265 265 4 6960.5 8.25 0.5 1140.44 811.75 0.3 7661.28 3.0 0.0
freq 937141 13 21 966236 22942 1249243 1252433 74397 73700 934475 78226 686053 1357263 421482 1317902 1362457 27973 1144198 5
count 1369769 1369769 1369769 1271417 1369769 1271417 1271417 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1271417 5
75% 2.0 nan nan 1.0 3.02 1.0 nan 236.0 236.0 1.0 13.5 2.5 0.5 2.75 0.0 0.3 19.12 2.5 0.0
50% 2.0 nan nan 1.0 1.7 1.0 nan 162.0 162.0 1.0 8.5 0.0 0.5 1.86 0.0 0.3 13.8 2.5 0.0
25% 1.0 nan nan 1.0 1.0 1.0 nan 124.0 107.0 1.0 6.0 0.0 0.5 0.0 0.0 0.3 10.8 2.5 0.0

Saving report as HTML#

To save the generated report as an HTML file, use the --output/-o attribute followed by the desired file name.

To save the profile of the penguins.csv data as an HTML file:

%sqlcmd profile --table "penguins.csv" --output penguins-report.html
Hide code cell output
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
unique 3 3 164 80 55 94 2
top Adelie Biscoe 41.1 17.0 190 3800 MALE
std nan nan 5.452e+00 1.972e+00 1.404e+01 8.008e+02 nan
min Adelie Biscoe 32.1 13.1 172 2700 FEMALE
mean nan nan 4.392e+01 1.715e+01 2.009e+02 4.202e+03 nan
max Gentoo Torgersen 59.6 21.5 231 6300 MALE
freq 152 168 7 12 22 12 168
count 344 344 342 342 342 342 333
75% nan nan 48.5 18.7 213.0 4750.0 nan
50% nan nan 44.4 17.3 197.0 4050.0 nan
25% nan nan 39.2 15.6 190.0 3550.0 nan
from IPython.display import HTML

HTML("penguins-report.html")
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
unique 3 3 164 80 55 94 2
top Adelie Biscoe 41.1 17.0 190 3800 MALE
std nan nan 5.452e+00 1.972e+00 1.404e+01 8.008e+02 nan
min Adelie Biscoe 32.1 13.1 172 2700 FEMALE
mean nan nan 4.392e+01 1.715e+01 2.009e+02 4.202e+03 nan
max Gentoo Torgersen 59.6 21.5 231 6300 MALE
freq 152 168 7 12 22 12 168
count 344 344 342 342 342 342 333
75% nan nan 48.5 18.7 213.0 4750.0 nan
50% nan nan 44.4 17.3 197.0 4050.0 nan
25% nan nan 39.2 15.6 190.0 3550.0 nan

To save the profile of the yellow_tripdata_2021.parquet data as an HTML file:

%sqlcmd profile --table "yellow_tripdata_2021.parquet" --output taxi-report.html
Hide code cell output
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
unique 3 939020 935992 9 3787 7 2 258 260 5 6017 91 3 2155 307 3 8321 5 1
top 2 2021-01-14 13:52:00 2021-01-09 11:01:04 1.0 0.8 1.0 N 236 236 1 6.0 0.0 0.5 0.0 0.0 0.3 9.3 2.5 0.0
std 5.925e-01 nan nan 1.060e+00 3.939e+02 5.995e-01 nan 6.784e+01 7.211e+01 5.777e-01 1.291e+01 1.231e+00 7.632e-02 2.597e+00 1.673e+00 4.222e-02 1.469e+01 7.989e-01 0.0
min 1 2008-12-31 23:05:14 2008-12-31 23:07:22 0.0 0.0 1.0 N 1 1 0 -490.0 -5.5 -0.5 -100.0 -31.12 -0.3 -492.8 -2.5 0.0
mean 1.722e+00 nan nan 1.412e+00 4.632e+00 1.035e+00 nan 1.652e+02 1.615e+02 1.189e+00 1.210e+01 9.705e-01 4.930e-01 1.918e+00 2.477e-01 2.969e-01 1.747e+01 2.239e+00 0.0
max 6 2021-02-22 16:52:16 2021-02-22 16:56:15 8.0 263163.28 99.0 Y 265 265 4 6960.5 8.25 0.5 1140.44 811.75 0.3 7661.28 3.0 0.0
freq 937141 13 21 966236 22942 1249243 1252433 74397 73700 934475 78226 686053 1357263 421482 1317902 1362457 27973 1144198 5
count 1369769 1369769 1369769 1271417 1369769 1271417 1271417 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1271417 5
75% 2.0 nan nan 1.0 3.02 1.0 nan 236.0 236.0 1.0 13.5 2.5 0.5 2.75 0.0 0.3 19.12 2.5 0.0
50% 2.0 nan nan 1.0 1.7 1.0 nan 162.0 162.0 1.0 8.5 0.0 0.5 1.86 0.0 0.3 13.8 2.5 0.0
25% 1.0 nan nan 1.0 1.0 1.0 nan 124.0 107.0 1.0 6.0 0.0 0.5 0.0 0.0 0.3 10.8 2.5 0.0
from IPython.display import HTML

HTML("taxi-report.html")
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
unique 3 939020 935992 9 3787 7 2 258 260 5 6017 91 3 2155 307 3 8321 5 1
top 2 2021-01-14 13:52:00 2021-01-09 11:01:04 1.0 0.8 1.0 N 236 236 1 6.0 0.0 0.5 0.0 0.0 0.3 9.3 2.5 0.0
std 5.925e-01 nan nan 1.060e+00 3.939e+02 5.995e-01 nan 6.784e+01 7.211e+01 5.777e-01 1.291e+01 1.231e+00 7.632e-02 2.597e+00 1.673e+00 4.222e-02 1.469e+01 7.989e-01 0.0
min 1 2008-12-31 23:05:14 2008-12-31 23:07:22 0.0 0.0 1.0 N 1 1 0 -490.0 -5.5 -0.5 -100.0 -31.12 -0.3 -492.8 -2.5 0.0
mean 1.722e+00 nan nan 1.412e+00 4.632e+00 1.035e+00 nan 1.652e+02 1.615e+02 1.189e+00 1.210e+01 9.705e-01 4.930e-01 1.918e+00 2.477e-01 2.969e-01 1.747e+01 2.239e+00 0.0
max 6 2021-02-22 16:52:16 2021-02-22 16:56:15 8.0 263163.28 99.0 Y 265 265 4 6960.5 8.25 0.5 1140.44 811.75 0.3 7661.28 3.0 0.0
freq 937141 13 21 966236 22942 1249243 1252433 74397 73700 934475 78226 686053 1357263 421482 1317902 1362457 27973 1144198 5
count 1369769 1369769 1369769 1271417 1369769 1271417 1271417 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1271417 5
75% 2.0 nan nan 1.0 3.02 1.0 nan 236.0 236.0 1.0 13.5 2.5 0.5 2.75 0.0 0.3 19.12 2.5 0.0
50% 2.0 nan nan 1.0 1.7 1.0 nan 162.0 162.0 1.0 8.5 0.0 0.5 1.86 0.0 0.3 13.8 2.5 0.0
25% 1.0 nan nan 1.0 1.0 1.0 nan 124.0 107.0 1.0 6.0 0.0 0.5 0.0 0.0 0.3 10.8 2.5 0.0

Use schemas with DuckDB#

To profile a specific table from various tables in different schemas, we can use the --schema/-s attribute.

Let’s save the file penguins.csv as a table penguins under the schema s1.

%%sql 
DROP TABLE IF EXISTS penguins;
CREATE SCHEMA IF NOT EXISTS s1;
CREATE TABLE s1.penguins (
    species VARCHAR(255),
    island VARCHAR(255),
    bill_length_mm DECIMAL(5, 2),
    bill_depth_mm DECIMAL(5, 2),
    flipper_length_mm DECIMAL(5, 2),
    body_mass_g INTEGER,
    sex VARCHAR(255)
);
COPY s1.penguins FROM 'penguins.csv' WITH (FORMAT CSV, HEADER TRUE);
*  duckdb://
Done.
Done.
Done.
Done.
Count
344
%sqlcmd profile --table penguins --schema s1 
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
unique 3 3 164 80 55 94 2
top Adelie Biscoe 41.10 17.00 190.00 3800 MALE
std nan nan 5.452e+00 1.972e+00 1.404e+01 8.008e+02 nan
min Adelie Biscoe 32.10 13.10 172.00 2700 FEMALE
mean nan nan 4.392e+01 1.715e+01 2.009e+02 4.202e+03 nan
max Gentoo Torgersen 59.60 21.50 231.00 6300 MALE
freq 152 168 7 12 22 12 168
count 344 344 342 342 342 342 333
75% nan nan 48.5 18.7 213.0 4750.0 nan
50% nan nan 44.4 17.3 197.0 4050.0 nan
25% nan nan 39.2 15.6 190.0 3550.0 nan

Use schemas with SQLite#

%%sql duckdb:///
INSTALL 'sqlite_scanner';
LOAD 'sqlite_scanner';
Done.
Done.
Success
import sqlite3

with sqlite3.connect("a.db") as conn:
    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
    conn.execute("INSERT INTO my_numbers VALUES (1)")
    conn.execute("INSERT INTO my_numbers VALUES (2)")
    conn.execute("INSERT INTO my_numbers VALUES (3)")
%%sql
ATTACH DATABASE 'a.db' AS a_schema
   duckdb://
*  duckdb:///
Done.
Success
import sqlite3

with sqlite3.connect("b.db") as conn:
    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
    conn.execute("INSERT INTO my_numbers VALUES (11)")
    conn.execute("INSERT INTO my_numbers VALUES (22)")
    conn.execute("INSERT INTO my_numbers VALUES (33)")
%%sql
ATTACH DATABASE 'b.db' AS b_schema
   duckdb://
*  duckdb:///
Done.
Success

Let’s profile my_numbers of b_schema

%sqlcmd profile --table my_numbers --schema b_schema
number
unique 3
top 11.0
std 8.981e+00
min 11.0
mean 22.0
max 33.0
freq 1
count 3
75% 33.0
50% 22.0
25% 11.0