Data profiling

Data profiling#

New in version 0.7:

pip install jupysql --upgrade

When dealing with a new dataset, it’s crucial for practitioners to have a comprehensive understanding of the data in a timely manner. This involves exploring and summarizing the dataset efficiently to extract valuable insights. However, this can be a time-consuming process. Fortunately, %sqlcmd profile offers an easy way to generate statistics and descriptive information, enabling practitioners to quickly gain a deeper understanding of the dataset.

Available statistics:

The count of non empty values
The number of unique values
The top (most frequent) value
The frequency of your top value
The mean, standard deviation, min and max values
The percentiles of your data: 25%, 50% and 75%.

Examples#

DuckDB#

In this example we’ll demonstrate the process of profiling a sample dataset that contains historical taxi data from NYC, using DuckDB. However, the code used here is compatible with all major databases.

Download the data

from pathlib import Path
from urllib.request import urlretrieve

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"

if not Path("yellow_tripdata_2021-01.parquet").is_file():
    urlretrieve(url, "yellow_tripdata_2021-01.parquet")

Setup

Note

This example requires duckdb-engine: pip install duckdb-engine

Load the extension and connect to an in-memory DuckDB database:

%load_ext sql

%sql duckdb://

Connecting to 'duckdb://'

%%sql
CREATE TABLE taxi_trips AS
SELECT * FROM 'yellow_tripdata_2021-01.parquet'

Running query in 'duckdb://'

Count

Profile table:

%sqlcmd profile --table taxi_trips

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
count	1369769	1369769	1369769	1271417	1369769	1271417	1271417	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1271417	5
unique	3	939020	935992	9	3787	7	2	258	260	5	6017	91	3	2155	307	3	8321	5	1
top	nan	2021-01-14 13:52:00	2021-01-09 11:01:04	nan	nan	nan	N	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0
freq	nan	13	21	nan	nan	nan	1252433	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	5
mean	1.7217	nan	nan	1.4115	4.6320	1.0351	nan	165.2474	161.4957	1.1886	12.0966	0.9705	0.4930	1.9181	0.2477	0.2969	17.4744	2.2390	nan
std	0.5925	nan	nan	1.0598	393.9035	0.5995	nan	67.8385	72.1079	0.5777	12.9134	1.2313	0.0763	2.5972	1.6728	0.0422	14.6934	0.7989	nan
min	1	nan	nan	0.0	0.0	1.0	nan	1	1	0	-490.0	-5.5	-0.5	-100.0	-31.12	-0.3	-492.8	-2.5	nan
25%	1.0000	nan	nan	1.0000	1.0000	1.0000	nan	124.0000	107.0000	1.0000	6.0000	0.0000	0.5000	0.0000	0.0000	0.3000	10.8000	2.5000	nan
50%	2.0000	nan	nan	1.0000	1.7000	1.0000	nan	162.0000	162.0000	1.0000	8.5000	0.0000	0.5000	1.8600	0.0000	0.3000	13.8000	2.5000	nan
75%	2.0000	nan	nan	1.0000	3.0200	1.0000	nan	236.0000	236.0000	1.0000	13.5000	2.5000	0.5000	2.7500	0.0000	0.3000	19.1200	2.5000	nan
max	6	nan	nan	8.0	263163.28	99.0	nan	265	265	4	6960.5	8.25	0.5	1140.44	811.75	0.3	7661.28	3.0	nan

Saving report as HTML#

To save the generated report as an HTML file, use the --output/-o attribute followed by the desired file name

%sqlcmd profile --table taxi_trips --output my-report.html

Show code cell output Hide code cell output

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
count	1369769	1369769	1369769	1271417	1369769	1271417	1271417	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1271417	5
unique	3	939020	935992	9	3787	7	2	258	260	5	6017	91	3	2155	307	3	8321	5	1
top	nan	2021-01-14 13:52:00	2021-01-09 11:01:04	nan	nan	nan	N	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0
freq	nan	13	21	nan	nan	nan	1252433	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	5
mean	1.7217	nan	nan	1.4115	4.6320	1.0351	nan	165.2474	161.4957	1.1886	12.0966	0.9705	0.4930	1.9181	0.2477	0.2969	17.4744	2.2390	nan
std	0.5925	nan	nan	1.0598	393.9035	0.5995	nan	67.8385	72.1079	0.5777	12.9134	1.2313	0.0763	2.5972	1.6728	0.0422	14.6934	0.7989	nan
min	1	nan	nan	0.0	0.0	1.0	nan	1	1	0	-490.0	-5.5	-0.5	-100.0	-31.12	-0.3	-492.8	-2.5	nan
25%	1.0000	nan	nan	1.0000	1.0000	1.0000	nan	124.0000	107.0000	1.0000	6.0000	0.0000	0.5000	0.0000	0.0000	0.3000	10.8000	2.5000	nan
50%	2.0000	nan	nan	1.0000	1.7000	1.0000	nan	162.0000	162.0000	1.0000	8.5000	0.0000	0.5000	1.8600	0.0000	0.3000	13.8000	2.5000	nan
75%	2.0000	nan	nan	1.0000	3.0200	1.0000	nan	236.0000	236.0000	1.0000	13.5000	2.5000	0.5000	2.7500	0.0000	0.3000	19.1200	2.5000	nan
max	6	nan	nan	8.0	263163.28	99.0	nan	265	265	4	6960.5	8.25	0.5	1140.44	811.75	0.3	7661.28	3.0	nan

from IPython.display import HTML

HTML("my-report.html")

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
count	1369769	1369769	1369769	1271417	1369769	1271417	1271417	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1271417	5
unique	3	939020	935992	9	3787	7	2	258	260	5	6017	91	3	2155	307	3	8321	5	1
top	nan	2021-01-14 13:52:00	2021-01-09 11:01:04	nan	nan	nan	N	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0
freq	nan	13	21	nan	nan	nan	1252433	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	5
mean	1.7217	nan	nan	1.4115	4.6320	1.0351	nan	165.2474	161.4957	1.1886	12.0966	0.9705	0.4930	1.9181	0.2477	0.2969	17.4744	2.2390	nan
std	0.5925	nan	nan	1.0598	393.9035	0.5995	nan	67.8385	72.1079	0.5777	12.9134	1.2313	0.0763	2.5972	1.6728	0.0422	14.6934	0.7989	nan
min	1	nan	nan	0.0	0.0	1.0	nan	1	1	0	-490.0	-5.5	-0.5	-100.0	-31.12	-0.3	-492.8	-2.5	nan
25%	1.0000	nan	nan	1.0000	1.0000	1.0000	nan	124.0000	107.0000	1.0000	6.0000	0.0000	0.5000	0.0000	0.0000	0.3000	10.8000	2.5000	nan
50%	2.0000	nan	nan	1.0000	1.7000	1.0000	nan	162.0000	162.0000	1.0000	8.5000	0.0000	0.5000	1.8600	0.0000	0.3000	13.8000	2.5000	nan
75%	2.0000	nan	nan	1.0000	3.0200	1.0000	nan	236.0000	236.0000	1.0000	13.5000	2.5000	0.5000	2.7500	0.0000	0.3000	19.1200	2.5000	nan
max	6	nan	nan	8.0	263163.28	99.0	nan	265	265	4	6960.5	8.25	0.5	1140.44	811.75	0.3	7661.28	3.0	nan

Use schemas#

To profile a specific table from various tables in different schemas, we can use the --schema/-s attribute.

%%sql
CREATE SCHEMA some_schema

%%sql
CREATE TABLE some_schema.trips AS
SELECT * FROM 'yellow_tripdata_2021-01.parquet'

Let’s profile my_numbers of b_schema

%sqlcmd profile --table trips --schema some_schema

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
count	1369769	1369769	1369769	1271417	1369769	1271417	1271417	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1271417	5
unique	3	939020	935992	9	3787	7	2	258	260	5	6017	91	3	2155	307	3	8321	5	1
top	nan	2021-01-14 13:52:00	2021-01-09 11:01:04	nan	nan	nan	N	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0
freq	nan	13	21	nan	nan	nan	1252433	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	5
mean	1.7217	nan	nan	1.4115	4.6320	1.0351	nan	165.2474	161.4957	1.1886	12.0966	0.9705	0.4930	1.9181	0.2477	0.2969	17.4744	2.2390	nan
std	0.5925	nan	nan	1.0598	393.9035	0.5995	nan	67.8385	72.1079	0.5777	12.9134	1.2313	0.0763	2.5972	1.6728	0.0422	14.6934	0.7989	nan
min	1	nan	nan	0.0	0.0	1.0	nan	1	1	0	-490.0	-5.5	-0.5	-100.0	-31.12	-0.3	-492.8	-2.5	nan
25%	1.0000	nan	nan	1.0000	1.0000	1.0000	nan	124.0000	107.0000	1.0000	6.0000	0.0000	0.5000	0.0000	0.0000	0.3000	10.8000	2.5000	nan
50%	2.0000	nan	nan	1.0000	1.7000	1.0000	nan	162.0000	162.0000	1.0000	8.5000	0.0000	0.5000	1.8600	0.0000	0.3000	13.8000	2.5000	nan
75%	2.0000	nan	nan	1.0000	3.0200	1.0000	nan	236.0000	236.0000	1.0000	13.5000	2.5000	0.5000	2.7500	0.0000	0.3000	19.1200	2.5000	nan
max	6	nan	nan	8.0	263163.28	99.0	nan	265	265	4	6960.5	8.25	0.5	1140.44	811.75	0.3	7661.28	3.0	nan

Parametrizing arguments#

JupySQL supports variable expansion of arguments in the form of {{variable}}. Let’s see an example using table, schema and output.

table = "trips"
schema = "some_schema"
output = "my-report.html"

%sqlcmd profile --table {{table}} --schema {{schema}} --output {{output}}

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
count	1369769	1369769	1369769	1271417	1369769	1271417	1271417	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1271417	5
unique	3	939020	935992	9	3787	7	2	258	260	5	6017	91	3	2155	307	3	8321	5	1
top	nan	2021-01-14 13:52:00	2021-01-09 11:01:04	nan	nan	nan	N	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0
freq	nan	13	21	nan	nan	nan	1252433	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	5
mean	1.7217	nan	nan	1.4115	4.6320	1.0351	nan	165.2474	161.4957	1.1886	12.0966	0.9705	0.4930	1.9181	0.2477	0.2969	17.4744	2.2390	nan
std	0.5925	nan	nan	1.0598	393.9035	0.5995	nan	67.8385	72.1079	0.5777	12.9134	1.2313	0.0763	2.5972	1.6728	0.0422	14.6934	0.7989	nan
min	1	nan	nan	0.0	0.0	1.0	nan	1	1	0	-490.0	-5.5	-0.5	-100.0	-31.12	-0.3	-492.8	-2.5	nan
25%	1.0000	nan	nan	1.0000	1.0000	1.0000	nan	124.0000	107.0000	1.0000	6.0000	0.0000	0.5000	0.0000	0.0000	0.3000	10.8000	2.5000	nan
50%	2.0000	nan	nan	1.0000	1.7000	1.0000	nan	162.0000	162.0000	1.0000	8.5000	0.0000	0.5000	1.8600	0.0000	0.3000	13.8000	2.5000	nan
75%	2.0000	nan	nan	1.0000	3.0200	1.0000	nan	236.0000	236.0000	1.0000	13.5000	2.5000	0.5000	2.7500	0.0000	0.3000	19.1200	2.5000	nan
max	6	nan	nan	8.0	263163.28	99.0	nan	265	265	4	6960.5	8.25	0.5	1140.44	811.75	0.3	7661.28	3.0	nan

from IPython.display import HTML

HTML(output)

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
count	1369769	1369769	1369769	1271417	1369769	1271417	1271417	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1369769	1271417	5
unique	3	939020	935992	9	3787	7	2	258	260	5	6017	91	3	2155	307	3	8321	5	1
top	nan	2021-01-14 13:52:00	2021-01-09 11:01:04	nan	nan	nan	N	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	0.0
freq	nan	13	21	nan	nan	nan	1252433	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	5
mean	1.7217	nan	nan	1.4115	4.6320	1.0351	nan	165.2474	161.4957	1.1886	12.0966	0.9705	0.4930	1.9181	0.2477	0.2969	17.4744	2.2390	nan
std	0.5925	nan	nan	1.0598	393.9035	0.5995	nan	67.8385	72.1079	0.5777	12.9134	1.2313	0.0763	2.5972	1.6728	0.0422	14.6934	0.7989	nan
min	1	nan	nan	0.0	0.0	1.0	nan	1	1	0	-490.0	-5.5	-0.5	-100.0	-31.12	-0.3	-492.8	-2.5	nan
25%	1.0000	nan	nan	1.0000	1.0000	1.0000	nan	124.0000	107.0000	1.0000	6.0000	0.0000	0.5000	0.0000	0.0000	0.3000	10.8000	2.5000	nan
50%	2.0000	nan	nan	1.0000	1.7000	1.0000	nan	162.0000	162.0000	1.0000	8.5000	0.0000	0.5000	1.8600	0.0000	0.3000	13.8000	2.5000	nan
75%	2.0000	nan	nan	1.0000	3.0200	1.0000	nan	236.0000	236.0000	1.0000	13.5000	2.5000	0.5000	2.7500	0.0000	0.3000	19.1200	2.5000	nan
max	6	nan	nan	8.0	263163.28	99.0	nan	265	265	4	6960.5	8.25	0.5	1140.44	811.75	0.3	7661.28	3.0	nan