Data profiling#

New in version 0.7:

pip install jupysql --upgrade

When dealing with a new dataset, it’s crucial for practitioners to have a comprehensive understanding of the data in a timely manner. This involves exploring and summarizing the dataset efficiently to extract valuable insights. However, this can be a time-consuming process. Fortunately, %sqlcmd profile offers an easy way to generate statistics and descriptive information, enabling practitioners to quickly gain a deeper understanding of the dataset.

Available statistics:

  • The count of non empty values

  • The number of unique values

  • The top (most frequent) value

  • The frequency of your top value

  • The mean, standard deviation, min and max values

  • The percentiles of your data: 25%, 50% and 75%.

Examples#

DuckDB#

In this example we’ll demonstrate the process of profiling a sample dataset that contains historical taxi data from NYC, using DuckDB. However, the code used here is compatible with all major databases.

Download the data

from pathlib import Path
from urllib.request import urlretrieve

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"

if not Path("yellow_tripdata_2021-01.parquet").is_file():
    urlretrieve(url, "yellow_tripdata_2021-01.parquet")

Setup

Note

This example requires duckdb-engine: pip install duckdb-engine

Load the extension and connect to an in-memory DuckDB database:

%load_ext sql
%sql duckdb://

Profile table

%sqlcmd profile --table "yellow_tripdata_2021-01.parquet"
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
unique 3 939020 935992 9 3787 7 2 258 260 5 6017 91 3 2155 307 3 8321 5 1
top 2 2021-01-14 13:52:00 2021-01-27 00:00:00 1.0 0.8 1.0 N 236 236 1 6.0 0.0 0.5 0.0 0.0 0.3 9.3 2.5 0.0
std 5.925e-01 nan nan 1.060e+00 3.939e+02 5.995e-01 nan 6.784e+01 7.211e+01 5.777e-01 1.291e+01 1.231e+00 7.632e-02 2.597e+00 1.673e+00 4.222e-02 1.469e+01 7.989e-01 0.0
min 1 2008-12-31 23:05:14 2008-12-31 23:07:22 0.0 0.0 1.0 N 1 1 0 -490.0 -5.5 -0.5 -100.0 -31.12 -0.3 -492.8 -2.5 0.0
mean 1.722e+00 nan nan 1.412e+00 4.632e+00 1.035e+00 nan 1.652e+02 1.615e+02 1.189e+00 1.210e+01 9.705e-01 4.930e-01 1.918e+00 2.477e-01 2.969e-01 1.747e+01 2.239e+00 0.0
max 6 2021-02-22 16:52:16 2021-02-22 16:56:15 8.0 263163.28 99.0 Y 265 265 4 6960.5 8.25 0.5 1140.44 811.75 0.3 7661.28 3.0 0.0
freq 937141 13 21 966236 22942 1249243 1252433 74397 73700 934475 78226 686053 1357263 421482 1317902 1362457 27973 1144198 5
count 1369769 1369769 1369769 1271417 1369769 1271417 1271417 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1369769 1271417 5
75% 2.0 nan nan 1.0 3.02 1.0 nan 236.0 236.0 1.0 13.5 2.5 0.5 2.75 0.0 0.3 19.12 2.5 0.0
50% 2.0 nan nan 1.0 1.7 1.0 nan 162.0 162.0 1.0 8.5 0.0 0.5 1.86 0.0 0.3 13.8 2.5 0.0
25% 1.0 nan nan 1.0 1.0 1.0 nan 124.0 107.0 1.0 6.0 0.0 0.5 0.0 0.0 0.3 10.8 2.5 0.0

SQLite#

We can easily explore large SQLite database using DuckDB.

import urllib.request
from pathlib import Path

if not Path("example.db").is_file():
    url = "https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"  # noqa
    urllib.request.urlretrieve(url, "example.db")
%%sql duckdb:///
INSTALL 'sqlite_scanner';
LOAD 'sqlite_scanner';
CALL sqlite_attach('example.db');
Hide code cell output
Done.
Done.
Done.
Success
%sqlcmd profile -t track
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
unique 3503 3257 347 5 25 852 3080 3501 2
top 1 The Number Of The Beast 141 1 1 Steve Harris 240091 17760384 0.99
std 1.011e+03 nan 8.176e+01 5.804e-01 6.189e+00 nan 534929.0658628318 105377489.40893556 2.390e-01
min 1 "40" 1 1 1 A. F. Iommi, W. Ward, T. Butler, J. Osbourne 1071 38747 0.99
mean 1752.0 nan 1.409e+02 1.208e+00 5.725e+00 nan 393599.2121039109 33510207.065372538 1.051e+00
max 3503 Último Pau-De-Arara 347 5 25 roger glover 5286953 1059546140 1.99
freq 1 5 57 3034 1297 80 4 2 3290
count 3503 3503 3503 3503 3503 2525 3503 3503 3503
75% 2628.0 nan 212.0 1.0 7.0 nan 321671.0 10267572.0 0.99
50% 1752.0 nan 141.0 1.0 3.0 nan 255634.0 8107896.0 0.99
25% 876.0 nan 70.0 1.0 1.0 nan 207229.0 6342021.0 0.99

Saving report as HTML#

To save the generated report as an HTML file, use the --output/-o attribute followed by the desired file name

%sqlcmd profile -t track --output my-report.html
Hide code cell output
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
unique 3503 3257 347 5 25 852 3080 3501 2
top 1 The Number Of The Beast 141 1 1 Steve Harris 240091 17760384 0.99
std 1.011e+03 nan 8.176e+01 5.804e-01 6.189e+00 nan 534929.0658628318 105377489.40893556 2.390e-01
min 1 "40" 1 1 1 A. F. Iommi, W. Ward, T. Butler, J. Osbourne 1071 38747 0.99
mean 1752.0 nan 1.409e+02 1.208e+00 5.725e+00 nan 393599.2121039109 33510207.065372538 1.051e+00
max 3503 Último Pau-De-Arara 347 5 25 roger glover 5286953 1059546140 1.99
freq 1 5 57 3034 1297 80 4 2 3290
count 3503 3503 3503 3503 3503 2525 3503 3503 3503
75% 2628.0 nan 212.0 1.0 7.0 nan 321671.0 10267572.0 0.99
50% 1752.0 nan 141.0 1.0 3.0 nan 255634.0 8107896.0 0.99
25% 876.0 nan 70.0 1.0 1.0 nan 207229.0 6342021.0 0.99
from IPython.display import HTML

HTML("my-report.html")
TrackId Name AlbumId MediaTypeId GenreId Composer Milliseconds Bytes UnitPrice
unique 3503 3257 347 5 25 852 3080 3501 2
top 1 The Number Of The Beast 141 1 1 Steve Harris 240091 17760384 0.99
std 1.011e+03 nan 8.176e+01 5.804e-01 6.189e+00 nan 534929.0658628318 105377489.40893556 2.390e-01
min 1 "40" 1 1 1 A. F. Iommi, W. Ward, T. Butler, J. Osbourne 1071 38747 0.99
mean 1752.0 nan 1.409e+02 1.208e+00 5.725e+00 nan 393599.2121039109 33510207.065372538 1.051e+00
max 3503 Último Pau-De-Arara 347 5 25 roger glover 5286953 1059546140 1.99
freq 1 5 57 3034 1297 80 4 2 3290
count 3503 3503 3503 3503 3503 2525 3503 3503 3503
75% 2628.0 nan 212.0 1.0 7.0 nan 321671.0 10267572.0 0.99
50% 1752.0 nan 141.0 1.0 3.0 nan 255634.0 8107896.0 0.99
25% 876.0 nan 70.0 1.0 1.0 nan 207229.0 6342021.0 0.99

Use schemas#

To profile a specific table from various tables in different schemas, we can use the --schema/-s attribute.

import sqlite3

with sqlite3.connect("a.db") as conn:
    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
    conn.execute("INSERT INTO my_numbers VALUES (1)")
    conn.execute("INSERT INTO my_numbers VALUES (2)")
    conn.execute("INSERT INTO my_numbers VALUES (3)")
%%sql
ATTACH DATABASE 'a.db' AS a_schema
Hide code cell output
   duckdb://
*  duckdb:///
Done.
Success
import sqlite3

with sqlite3.connect("b.db") as conn:
    conn.execute("CREATE TABLE my_numbers (number FLOAT)")
    conn.execute("INSERT INTO my_numbers VALUES (11)")
    conn.execute("INSERT INTO my_numbers VALUES (22)")
    conn.execute("INSERT INTO my_numbers VALUES (33)")
%%sql
ATTACH DATABASE 'b.db' AS b_schema
Hide code cell output
   duckdb://
*  duckdb:///
Done.
Success

Let’s profile my_numbers of b_schema

%sqlcmd profile --table my_numbers --schema b_schema
number
unique 3
top 11.0
std 8.981e+00
min 11.0
mean 22.0
max 33.0
freq 1
count 3
75% 33.0
50% 22.0
25% 11.0