chDB

chDB#

JupySQL integrates with chDB so you can run SQL queries in a Jupyter notebook. Jump into any section to learn more!

Pre-requisites for `.parquet` file#

%pip install jupysql chdb pyarrow --quiet

Note: you may need to restart the kernel to use updated packages.

from chdb import dbapi

conn = dbapi.connect()

%load_ext sql
%sql conn --alias chdb

Get a sample `.parquet` file:#

from urllib.request import urlretrieve

_ = urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
    "yellow_tripdata_2021-01.parquet",
)

Query on S3/HTTP/File#

Query a local file

%%sql
SELECT
    passenger_count, AVG(trip_distance) AS avg_trip_distance
FROM file("yellow_tripdata_2021-01.parquet")
GROUP BY passenger_count

Running query in 'chdb'

passenger_count	avg_trip_distance
None	29.665125772734516
0	2.5424466811344746
7	11.134
4	2.8681984015618376
3	2.7576410606578126
5	2.694099520730797
1	2.6805563237138768
2	2.794832592116103
8	1.05
6	2.5745177825092656

Truncated to displaylimit of 10.

Run a file over HTTP

%%sql
SELECT
    RegionID, SUM(AdvEngineID), COUNT(*) AS c, AVG(ResolutionWidth), COUNT(DISTINCT UserID)
FROM url('https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_0.parquet')
-- query on s3 --
--  FROM s3('xxxx')
GROUP BY
    RegionID
ORDER BY c
DESC LIMIT 10