Clickhouse#
In this tutorial, we’ll see how to query Clickhouse from Jupyter. Optionally, you can spin up a testing server.
Tip
If you encounter issues, feel free to join our community and we’ll be happy to help!
Pre-requisites#
To run this tutorial, you need to install the clickhouse-sqlalchemy
package.
%pip install clickhouse-sqlalchemy --quiet
Note: you may need to restart the kernel to use updated packages.
Start Clickhoouse instance#
If you don’t have a Clickhouse server running or you want to spin up one for testing, you can do it with the official Docker image.
To start the server:
%%bash
docker run --detach --name clickhouse \
-e CLICKHOUSE_DB=my_database \
-e CLICKHOUSE_USER=username \
-e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
-e CLICKHOUSE_PASSWORD=password \
-p 9000:9000/tcp clickhouse/clickhouse-server
cefe171d72a8b46a529dc15105dca08e1c7cfa90aabbbcb32ffe023d22418ee9
Ensure that the container is running:
%%bash
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cefe171d72a8 clickhouse/clickhouse-server "/entrypoint.sh" 2 seconds ago Up 1 second 8123/tcp, 9009/tcp, 0.0.0.0:9000->9000/tcp clickhouse
Load sample data#
We’ll now uplod sample data.
First, let’s install and load JupySQL:
%pip install jupysql --quiet
%load_ext sql
Note: you may need to restart the kernel to use updated packages.
The sql extension is already loaded. To reload it, use:
%reload_ext sql
Start the connection:
%sql clickhouse+native://username:password@localhost/my_database
Create a table:
%%sql
CREATE TABLE taxi
(
VendorID Int32,
tpep_pickup_datetime DateTime,
tpep_dropoff_datetime DateTime,
passenger_count Float32,
trip_distance Float32,
RatecodeID Float32,
store_and_fwd_flag String,
PULocationID Int32,
DOLocationID Int32,
payment_type Int32,
fare_amount Float32,
extra Float32,
mta_tax Float32,
tip_amount Float32,
tolls_amount Float32,
improvement_surcharge Float32,
total_amount Float32,
congestion_surcharge Float32,
airport_fee Float32
)
ENGINE = MergeTree()
PRIMARY KEY (VendorID)
* clickhouse+native://username:***@localhost/my_database
Done.
Now, we’ll load 1.4 million rows into our table.
If you’re using the Docker container, you can execute the followig in a terminal to start a bash session:
docker exec -it clickhouse bash
Now, to load the data:
apt update
apt install curl -y
curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet | clickhouse-client --query="INSERT INTO my_database.taxi FORMAT Parquet"
Query#
Let’s query our data!
%%sql
SELECT * FROM taxi LIMIT 5
* clickhouse+native://username:***@localhost/my_database
Done.
VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2021-01-01 00:30:10 | 2021-01-01 00:36:12 | 1.0 | 2.0999999046325684 | 1.0 | N | 142 | 43 | 2 | 8.0 | 3.0 | 0.5 | 0.0 | 0.0 | 0.30000001192092896 | 11.800000190734863 | 2.5 | 0.0 |
1 | 2021-01-01 00:51:20 | 2021-01-01 00:52:19 | 1.0 | 0.20000000298023224 | 1.0 | N | 238 | 151 | 2 | 3.0 | 0.5 | 0.5 | 0.0 | 0.0 | 0.30000001192092896 | 4.300000190734863 | 0.0 | 0.0 |
1 | 2021-01-01 00:43:30 | 2021-01-01 01:11:06 | 1.0 | 14.699999809265137 | 1.0 | N | 132 | 165 | 1 | 42.0 | 0.5 | 0.5 | 8.649999618530273 | 0.0 | 0.30000001192092896 | 51.95000076293945 | 0.0 | 0.0 |
1 | 2021-01-01 00:15:48 | 2021-01-01 00:31:01 | 0.0 | 10.600000381469727 | 1.0 | N | 138 | 132 | 1 | 29.0 | 0.5 | 0.5 | 6.050000190734863 | 0.0 | 0.30000001192092896 | 36.349998474121094 | 0.0 | 0.0 |
1 | 2021-01-01 00:16:29 | 2021-01-01 00:24:30 | 1.0 | 1.600000023841858 | 1.0 | N | 224 | 68 | 1 | 8.0 | 3.0 | 0.5 | 2.3499999046325684 | 0.0 | 0.30000001192092896 | 14.149999618530273 | 2.5 | 0.0 |
List the tables in the database:
%sqlcmd tables
Name |
---|
taxi |
List columns in the taxi table:
%sqlcmd columns --table taxi
name | type | nullable | default | comment |
---|---|---|---|---|
VendorID | Int32 | False | None | None |
tpep_pickup_datetime | DateTime | False | None | None |
tpep_dropoff_datetime | DateTime | False | None | None |
passenger_count | Float32 | False | None | None |
trip_distance | Float32 | False | None | None |
RatecodeID | Float32 | False | None | None |
store_and_fwd_flag | String | False | None | None |
PULocationID | Int32 | False | None | None |
DOLocationID | Int32 | False | None | None |
payment_type | Int32 | False | None | None |
fare_amount | Float32 | False | None | None |
extra | Float32 | False | None | None |
mta_tax | Float32 | False | None | None |
tip_amount | Float32 | False | None | None |
tolls_amount | Float32 | False | None | None |
improvement_surcharge | Float32 | False | None | None |
total_amount | Float32 | False | None | None |
congestion_surcharge | Float32 | False | None | None |
airport_fee | Float32 | False | None | None |
Plotting#
Let’s compute the 99th quantile of the trip_distance
column to remove outliers:
%%sql
SELECT quantile(0.99)(trip_distance)
FROM taxi
* clickhouse+native://username:***@localhost/my_database
Done.
quantile(0.99)(trip_distance) |
---|
19.21179912567139 |
We now use --save
to store this SQL SELECT statement:
%%sql --save no_outliers --no-execute
SELECT trip_distance
FROM taxi
WHERE trip_distance < 18.7
* clickhouse+native://username:***@localhost/my_database
Skipping execution...
Now, we can pass it to the plotting command:
%sqlplot histogram --table no_outliers --column trip_distance --with no_outliers
<Axes: title={'center': "'trip_distance' from 'no_outliers'"}, xlabel='trip_distance', ylabel='Count'>

Clean up#
To stop and remove the container:
%%bash
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
cefe171d72a8 clickhouse/clickhouse-server "/entrypoint.sh" 51 seconds ago Up 49 seconds 8123/tcp, 9009/tcp, 0.0.0.0:9000->9000/tcp clickhouse
%%bash
docker container stop clickhouse
clickhouse
%%bash
docker container rm clickhouse
clickhouse
%%bash
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES