%sqlcmd profile
#
%sqlcmd profile
allows you to obtain summary statistics of a table quickly. The code used here is compatible with all major databases.
Note
You can view the documentation and command line arguments by running %sqlcmd?
Arguments:
-t
/--table
(Required) Get the profile of a table
-s
/--schema
(Optional) Get the profile of a table under a specified schema
-o
/--output
(Optional) Output the profile at a specified location (path name expected)
Note
This example requires duckdb-engine: pip install duckdb-engine
Load CSV Data with DuckDB#
Load the extension and connect to an in-memory DuckDB database:
%load_ext sql
%sql duckdb://
There's a new jupysql version available (0.7.6), you're running 0.7.7.dev0. To upgrade: pip install jupysql --upgrade
Load and download penguins.csv
dataset , using DuckDB.
from pathlib import Path
from urllib.request import urlretrieve
if not Path("penguins.csv").is_file():
urlretrieve(
"https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
"penguins.csv",
)
%%sql
SELECT * FROM "penguins.csv" LIMIT 3
* duckdb://
Done.
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | MALE |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | FEMALE |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | FEMALE |
Load Parquet Data with DuckDB#
Load and download a sample dataset that contains historical taxi data from NYC, using DuckDB.
import os
from pathlib import Path
from urllib.request import urlretrieve
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
new_filename = "yellow_tripdata_2021.parquet"
if not Path(new_filename).is_file():
urlretrieve(url, new_filename)
# Rename the downloaded file to remove the month ("-" interferes with the SQL query)
os.rename(new_filename, new_filename.replace("-01", ""))
%%sql
SELECT * FROM yellow_tripdata_2021.parquet LIMIT 3
* duckdb://
Done.
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2021-01-01 00:30:10 | 2021-01-01 00:36:12 | 1.0 | 2.1 | 1.0 | N | 142 | 43 | 2 | 8.0 | 3.0 | 0.5 | 0.0 | 0.0 | 0.3 | 11.8 | 2.5 | None |
1 | 2021-01-01 00:51:20 | 2021-01-01 00:52:19 | 1.0 | 0.2 | 1.0 | N | 238 | 151 | 2 | 3.0 | 0.5 | 0.5 | 0.0 | 0.0 | 0.3 | 4.3 | 0.0 | None |
1 | 2021-01-01 00:43:30 | 2021-01-01 01:11:06 | 1.0 | 14.7 | 1.0 | N | 132 | 165 | 1 | 42.0 | 0.5 | 0.5 | 8.65 | 0.0 | 0.3 | 51.95 | 0.0 | None |
Profile#
Let us profile the penguins.csv
data
%sqlcmd profile --table "penguins.csv"
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
unique | 3 | 3 | 164 | 80 | 55 | 94 | 2 |
top | Adelie | Biscoe | 41.1 | 17.0 | 190 | 3800 | MALE |
std | nan | nan | 5.452e+00 | 1.972e+00 | 1.404e+01 | 8.008e+02 | nan |
min | Adelie | Biscoe | 32.1 | 13.1 | 172 | 2700 | FEMALE |
mean | nan | nan | 4.392e+01 | 1.715e+01 | 2.009e+02 | 4.202e+03 | nan |
max | Gentoo | Torgersen | 59.6 | 21.5 | 231 | 6300 | MALE |
freq | 152 | 168 | 7 | 12 | 22 | 12 | 168 |
count | 344 | 344 | 342 | 342 | 342 | 342 | 333 |
75% | nan | nan | 48.5 | 18.7 | 213.0 | 4750.0 | nan |
50% | nan | nan | 44.4 | 17.3 | 197.0 | 4050.0 | nan |
25% | nan | nan | 39.2 | 15.6 | 190.0 | 3550.0 | nan |
Let us profile the yellow_tripdata_2021.parquet
data
%sqlcmd profile --table "yellow_tripdata_2021.parquet"
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
unique | 3 | 939020 | 935992 | 9 | 3787 | 7 | 2 | 258 | 260 | 5 | 6017 | 91 | 3 | 2155 | 307 | 3 | 8321 | 5 | 1 |
top | 2 | 2021-01-14 13:52:00 | 2021-01-27 00:00:00 | 1.0 | 0.8 | 1.0 | N | 236 | 236 | 1 | 6.0 | 0.0 | 0.5 | 0.0 | 0.0 | 0.3 | 9.3 | 2.5 | 0.0 |
std | 5.925e-01 | nan | nan | 1.060e+00 | 3.939e+02 | 5.995e-01 | nan | 6.784e+01 | 7.211e+01 | 5.777e-01 | 1.291e+01 | 1.231e+00 | 7.632e-02 | 2.597e+00 | 1.673e+00 | 4.222e-02 | 1.469e+01 | 7.989e-01 | 0.0 |
min | 1 | 2008-12-31 23:05:14 | 2008-12-31 23:07:22 | 0.0 | 0.0 | 1.0 | N | 1 | 1 | 0 | -490.0 | -5.5 | -0.5 | -100.0 | -31.12 | -0.3 | -492.8 | -2.5 | 0.0 |
mean | 1.722e+00 | nan | nan | 1.412e+00 | 4.632e+00 | 1.035e+00 | nan | 1.652e+02 | 1.615e+02 | 1.189e+00 | 1.210e+01 | 9.705e-01 | 4.930e-01 | 1.918e+00 | 2.477e-01 | 2.969e-01 | 1.747e+01 | 2.239e+00 | 0.0 |
max | 6 | 2021-02-22 16:52:16 | 2021-02-22 16:56:15 | 8.0 | 263163.28 | 99.0 | Y | 265 | 265 | 4 | 6960.5 | 8.25 | 0.5 | 1140.44 | 811.75 | 0.3 | 7661.28 | 3.0 | 0.0 |
freq | 937141 | 13 | 21 | 966236 | 22942 | 1249243 | 1252433 | 74397 | 73700 | 934475 | 78226 | 686053 | 1357263 | 421482 | 1317902 | 1362457 | 27973 | 1144198 | 5 |
count | 1369769 | 1369769 | 1369769 | 1271417 | 1369769 | 1271417 | 1271417 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1271417 | 5 |
75% | 2.0 | nan | nan | 1.0 | 3.02 | 1.0 | nan | 236.0 | 236.0 | 1.0 | 13.5 | 2.5 | 0.5 | 2.75 | 0.0 | 0.3 | 19.12 | 2.5 | 0.0 |
50% | 2.0 | nan | nan | 1.0 | 1.7 | 1.0 | nan | 162.0 | 162.0 | 1.0 | 8.5 | 0.0 | 0.5 | 1.86 | 0.0 | 0.3 | 13.8 | 2.5 | 0.0 |
25% | 1.0 | nan | nan | 1.0 | 1.0 | 1.0 | nan | 124.0 | 107.0 | 1.0 | 6.0 | 0.0 | 0.5 | 0.0 | 0.0 | 0.3 | 10.8 | 2.5 | 0.0 |
Saving report as HTML#
To save the generated report as an HTML file, use the --output/-o
attribute followed by the desired file name.
To save the profile of the penguins.csv
data as an HTML file:
%sqlcmd profile --table "penguins.csv" --output penguins-report.html
Show code cell output
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
unique | 3 | 3 | 164 | 80 | 55 | 94 | 2 |
top | Adelie | Biscoe | 41.1 | 17.0 | 190 | 3800 | MALE |
std | nan | nan | 5.452e+00 | 1.972e+00 | 1.404e+01 | 8.008e+02 | nan |
min | Adelie | Biscoe | 32.1 | 13.1 | 172 | 2700 | FEMALE |
mean | nan | nan | 4.392e+01 | 1.715e+01 | 2.009e+02 | 4.202e+03 | nan |
max | Gentoo | Torgersen | 59.6 | 21.5 | 231 | 6300 | MALE |
freq | 152 | 168 | 7 | 12 | 22 | 12 | 168 |
count | 344 | 344 | 342 | 342 | 342 | 342 | 333 |
75% | nan | nan | 48.5 | 18.7 | 213.0 | 4750.0 | nan |
50% | nan | nan | 44.4 | 17.3 | 197.0 | 4050.0 | nan |
25% | nan | nan | 39.2 | 15.6 | 190.0 | 3550.0 | nan |
from IPython.display import HTML
HTML("penguins-report.html")
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
unique | 3 | 3 | 164 | 80 | 55 | 94 | 2 |
top | Adelie | Biscoe | 41.1 | 17.0 | 190 | 3800 | MALE |
std | nan | nan | 5.452e+00 | 1.972e+00 | 1.404e+01 | 8.008e+02 | nan |
min | Adelie | Biscoe | 32.1 | 13.1 | 172 | 2700 | FEMALE |
mean | nan | nan | 4.392e+01 | 1.715e+01 | 2.009e+02 | 4.202e+03 | nan |
max | Gentoo | Torgersen | 59.6 | 21.5 | 231 | 6300 | MALE |
freq | 152 | 168 | 7 | 12 | 22 | 12 | 168 |
count | 344 | 344 | 342 | 342 | 342 | 342 | 333 |
75% | nan | nan | 48.5 | 18.7 | 213.0 | 4750.0 | nan |
50% | nan | nan | 44.4 | 17.3 | 197.0 | 4050.0 | nan |
25% | nan | nan | 39.2 | 15.6 | 190.0 | 3550.0 | nan |
To save the profile of the yellow_tripdata_2021.parquet
data as an HTML file:
%sqlcmd profile --table "yellow_tripdata_2021.parquet" --output taxi-report.html
Show code cell output
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
unique | 3 | 939020 | 935992 | 9 | 3787 | 7 | 2 | 258 | 260 | 5 | 6017 | 91 | 3 | 2155 | 307 | 3 | 8321 | 5 | 1 |
top | 2 | 2021-01-14 13:52:00 | 2021-01-09 11:01:04 | 1.0 | 0.8 | 1.0 | N | 236 | 236 | 1 | 6.0 | 0.0 | 0.5 | 0.0 | 0.0 | 0.3 | 9.3 | 2.5 | 0.0 |
std | 5.925e-01 | nan | nan | 1.060e+00 | 3.939e+02 | 5.995e-01 | nan | 6.784e+01 | 7.211e+01 | 5.777e-01 | 1.291e+01 | 1.231e+00 | 7.632e-02 | 2.597e+00 | 1.673e+00 | 4.222e-02 | 1.469e+01 | 7.989e-01 | 0.0 |
min | 1 | 2008-12-31 23:05:14 | 2008-12-31 23:07:22 | 0.0 | 0.0 | 1.0 | N | 1 | 1 | 0 | -490.0 | -5.5 | -0.5 | -100.0 | -31.12 | -0.3 | -492.8 | -2.5 | 0.0 |
mean | 1.722e+00 | nan | nan | 1.412e+00 | 4.632e+00 | 1.035e+00 | nan | 1.652e+02 | 1.615e+02 | 1.189e+00 | 1.210e+01 | 9.705e-01 | 4.930e-01 | 1.918e+00 | 2.477e-01 | 2.969e-01 | 1.747e+01 | 2.239e+00 | 0.0 |
max | 6 | 2021-02-22 16:52:16 | 2021-02-22 16:56:15 | 8.0 | 263163.28 | 99.0 | Y | 265 | 265 | 4 | 6960.5 | 8.25 | 0.5 | 1140.44 | 811.75 | 0.3 | 7661.28 | 3.0 | 0.0 |
freq | 937141 | 13 | 21 | 966236 | 22942 | 1249243 | 1252433 | 74397 | 73700 | 934475 | 78226 | 686053 | 1357263 | 421482 | 1317902 | 1362457 | 27973 | 1144198 | 5 |
count | 1369769 | 1369769 | 1369769 | 1271417 | 1369769 | 1271417 | 1271417 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1271417 | 5 |
75% | 2.0 | nan | nan | 1.0 | 3.02 | 1.0 | nan | 236.0 | 236.0 | 1.0 | 13.5 | 2.5 | 0.5 | 2.75 | 0.0 | 0.3 | 19.12 | 2.5 | 0.0 |
50% | 2.0 | nan | nan | 1.0 | 1.7 | 1.0 | nan | 162.0 | 162.0 | 1.0 | 8.5 | 0.0 | 0.5 | 1.86 | 0.0 | 0.3 | 13.8 | 2.5 | 0.0 |
25% | 1.0 | nan | nan | 1.0 | 1.0 | 1.0 | nan | 124.0 | 107.0 | 1.0 | 6.0 | 0.0 | 0.5 | 0.0 | 0.0 | 0.3 | 10.8 | 2.5 | 0.0 |
from IPython.display import HTML
HTML("taxi-report.html")
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
unique | 3 | 939020 | 935992 | 9 | 3787 | 7 | 2 | 258 | 260 | 5 | 6017 | 91 | 3 | 2155 | 307 | 3 | 8321 | 5 | 1 |
top | 2 | 2021-01-14 13:52:00 | 2021-01-09 11:01:04 | 1.0 | 0.8 | 1.0 | N | 236 | 236 | 1 | 6.0 | 0.0 | 0.5 | 0.0 | 0.0 | 0.3 | 9.3 | 2.5 | 0.0 |
std | 5.925e-01 | nan | nan | 1.060e+00 | 3.939e+02 | 5.995e-01 | nan | 6.784e+01 | 7.211e+01 | 5.777e-01 | 1.291e+01 | 1.231e+00 | 7.632e-02 | 2.597e+00 | 1.673e+00 | 4.222e-02 | 1.469e+01 | 7.989e-01 | 0.0 |
min | 1 | 2008-12-31 23:05:14 | 2008-12-31 23:07:22 | 0.0 | 0.0 | 1.0 | N | 1 | 1 | 0 | -490.0 | -5.5 | -0.5 | -100.0 | -31.12 | -0.3 | -492.8 | -2.5 | 0.0 |
mean | 1.722e+00 | nan | nan | 1.412e+00 | 4.632e+00 | 1.035e+00 | nan | 1.652e+02 | 1.615e+02 | 1.189e+00 | 1.210e+01 | 9.705e-01 | 4.930e-01 | 1.918e+00 | 2.477e-01 | 2.969e-01 | 1.747e+01 | 2.239e+00 | 0.0 |
max | 6 | 2021-02-22 16:52:16 | 2021-02-22 16:56:15 | 8.0 | 263163.28 | 99.0 | Y | 265 | 265 | 4 | 6960.5 | 8.25 | 0.5 | 1140.44 | 811.75 | 0.3 | 7661.28 | 3.0 | 0.0 |
freq | 937141 | 13 | 21 | 966236 | 22942 | 1249243 | 1252433 | 74397 | 73700 | 934475 | 78226 | 686053 | 1357263 | 421482 | 1317902 | 1362457 | 27973 | 1144198 | 5 |
count | 1369769 | 1369769 | 1369769 | 1271417 | 1369769 | 1271417 | 1271417 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1369769 | 1271417 | 5 |
75% | 2.0 | nan | nan | 1.0 | 3.02 | 1.0 | nan | 236.0 | 236.0 | 1.0 | 13.5 | 2.5 | 0.5 | 2.75 | 0.0 | 0.3 | 19.12 | 2.5 | 0.0 |
50% | 2.0 | nan | nan | 1.0 | 1.7 | 1.0 | nan | 162.0 | 162.0 | 1.0 | 8.5 | 0.0 | 0.5 | 1.86 | 0.0 | 0.3 | 13.8 | 2.5 | 0.0 |
25% | 1.0 | nan | nan | 1.0 | 1.0 | 1.0 | nan | 124.0 | 107.0 | 1.0 | 6.0 | 0.0 | 0.5 | 0.0 | 0.0 | 0.3 | 10.8 | 2.5 | 0.0 |
Use schemas with DuckDB#
To profile a specific table from various tables in different schemas, we can use the --schema/-s
attribute.
Let’s save the file penguins.csv as a table penguins
under the schema s1
.
%%sql
DROP TABLE IF EXISTS penguins;
CREATE SCHEMA IF NOT EXISTS s1;
CREATE TABLE s1.penguins (
species VARCHAR(255),
island VARCHAR(255),
bill_length_mm DECIMAL(5, 2),
bill_depth_mm DECIMAL(5, 2),
flipper_length_mm DECIMAL(5, 2),
body_mass_g INTEGER,
sex VARCHAR(255)
);
COPY s1.penguins FROM 'penguins.csv' WITH (FORMAT CSV, HEADER TRUE);
* duckdb://
Done.
Done.
Done.
Done.
Count |
---|
344 |
%sqlcmd profile --table penguins --schema s1
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
unique | 3 | 3 | 164 | 80 | 55 | 94 | 2 |
top | Adelie | Biscoe | 41.10 | 17.00 | 190.00 | 3800 | MALE |
std | nan | nan | 5.452e+00 | 1.972e+00 | 1.404e+01 | 8.008e+02 | nan |
min | Adelie | Biscoe | 32.10 | 13.10 | 172.00 | 2700 | FEMALE |
mean | nan | nan | 4.392e+01 | 1.715e+01 | 2.009e+02 | 4.202e+03 | nan |
max | Gentoo | Torgersen | 59.60 | 21.50 | 231.00 | 6300 | MALE |
freq | 152 | 168 | 7 | 12 | 22 | 12 | 168 |
count | 344 | 344 | 342 | 342 | 342 | 342 | 333 |
75% | nan | nan | 48.5 | 18.7 | 213.0 | 4750.0 | nan |
50% | nan | nan | 44.4 | 17.3 | 197.0 | 4050.0 | nan |
25% | nan | nan | 39.2 | 15.6 | 190.0 | 3550.0 | nan |
Use schemas with SQLite#
%%sql duckdb:///
INSTALL 'sqlite_scanner';
LOAD 'sqlite_scanner';
Done.
Done.
Success |
---|
import sqlite3
with sqlite3.connect("a.db") as conn:
conn.execute("CREATE TABLE my_numbers (number FLOAT)")
conn.execute("INSERT INTO my_numbers VALUES (1)")
conn.execute("INSERT INTO my_numbers VALUES (2)")
conn.execute("INSERT INTO my_numbers VALUES (3)")
%%sql
ATTACH DATABASE 'a.db' AS a_schema
duckdb://
* duckdb:///
Done.
Success |
---|
import sqlite3
with sqlite3.connect("b.db") as conn:
conn.execute("CREATE TABLE my_numbers (number FLOAT)")
conn.execute("INSERT INTO my_numbers VALUES (11)")
conn.execute("INSERT INTO my_numbers VALUES (22)")
conn.execute("INSERT INTO my_numbers VALUES (33)")
%%sql
ATTACH DATABASE 'b.db' AS b_schema
duckdb://
* duckdb:///
Done.
Success |
---|
Let’s profile my_numbers
of b_schema
%sqlcmd profile --table my_numbers --schema b_schema
number | |
---|---|
unique | 3 |
top | 11.0 |
std | 8.981e+00 |
min | 11.0 |
mean | 22.0 |
max | 33.0 |
freq | 1 |
count | 3 |
75% | 33.0 |
50% | 22.0 |
25% | 11.0 |