Clickhouse#

In this tutorial, we’ll see how to query Clickhouse from Jupyter. Optionally, you can spin up a testing server.

Tip

If you encounter issues, feel free to join our community and we’ll be happy to help!

Pre-requisites#

To run this tutorial, you need to install the clickhouse-sqlalchemy package.

%pip install clickhouse-sqlalchemy --quiet
Note: you may need to restart the kernel to use updated packages.

Start Clickhoouse instance#

If you don’t have a Clickhouse server running or you want to spin up one for testing, you can do it with the official Docker image.

To start the server:

%%bash
docker run --detach --name clickhouse \
    -e CLICKHOUSE_DB=my_database \
    -e CLICKHOUSE_USER=username \
    -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
    -e CLICKHOUSE_PASSWORD=password \
    -p 9000:9000/tcp clickhouse/clickhouse-server
cefe171d72a8b46a529dc15105dca08e1c7cfa90aabbbcb32ffe023d22418ee9

Ensure that the container is running:

%%bash
docker ps
CONTAINER ID   IMAGE                          COMMAND            CREATED         STATUS        PORTS                                        NAMES
cefe171d72a8   clickhouse/clickhouse-server   "/entrypoint.sh"   2 seconds ago   Up 1 second   8123/tcp, 9009/tcp, 0.0.0.0:9000->9000/tcp   clickhouse

Load sample data#

We’ll now uplod sample data.

First, let’s install and load JupySQL:

%pip install jupysql --quiet
%load_ext sql
Note: you may need to restart the kernel to use updated packages.
The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Start the connection:

%sql clickhouse+native://username:password@localhost/my_database

Create a table:

%%sql
CREATE TABLE taxi
(
    VendorID Int32,
    tpep_pickup_datetime DateTime,
    tpep_dropoff_datetime DateTime,
    passenger_count Float32,
    trip_distance Float32,
    RatecodeID Float32,
    store_and_fwd_flag String,
    PULocationID Int32,
    DOLocationID Int32,
    payment_type Int32,
    fare_amount Float32,
    extra Float32,
    mta_tax Float32,
    tip_amount Float32,
    tolls_amount Float32,
    improvement_surcharge Float32,
    total_amount Float32,
    congestion_surcharge Float32,
    airport_fee Float32
)
ENGINE = MergeTree()
PRIMARY KEY (VendorID)
*  clickhouse+native://username:***@localhost/my_database
Done.

Now, we’ll load 1.4 million rows into our table.

If you’re using the Docker container, you can execute the following in a terminal to start a bash session:

docker exec -it clickhouse bash

Now, to load the data:

apt update
apt install curl -y

curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet | clickhouse-client --query="INSERT INTO my_database.taxi FORMAT Parquet"

Query#

Let’s query our data!

%%sql
SELECT * FROM taxi LIMIT 5
*  clickhouse+native://username:***@localhost/my_database
Done.
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
1 2021-01-01 00:30:10 2021-01-01 00:36:12 1.0 2.0999999046325684 1.0 N 142 43 2 8.0 3.0 0.5 0.0 0.0 0.30000001192092896 11.800000190734863 2.5 0.0
1 2021-01-01 00:51:20 2021-01-01 00:52:19 1.0 0.20000000298023224 1.0 N 238 151 2 3.0 0.5 0.5 0.0 0.0 0.30000001192092896 4.300000190734863 0.0 0.0
1 2021-01-01 00:43:30 2021-01-01 01:11:06 1.0 14.699999809265137 1.0 N 132 165 1 42.0 0.5 0.5 8.649999618530273 0.0 0.30000001192092896 51.95000076293945 0.0 0.0
1 2021-01-01 00:15:48 2021-01-01 00:31:01 0.0 10.600000381469727 1.0 N 138 132 1 29.0 0.5 0.5 6.050000190734863 0.0 0.30000001192092896 36.349998474121094 0.0 0.0
1 2021-01-01 00:16:29 2021-01-01 00:24:30 1.0 1.600000023841858 1.0 N 224 68 1 8.0 3.0 0.5 2.3499999046325684 0.0 0.30000001192092896 14.149999618530273 2.5 0.0

List the tables in the database:

%sqlcmd tables
Name
taxi

List columns in the taxi table:

%sqlcmd columns --table taxi
name type nullable default comment
VendorID Int32 False None None
tpep_pickup_datetime DateTime False None None
tpep_dropoff_datetime DateTime False None None
passenger_count Float32 False None None
trip_distance Float32 False None None
RatecodeID Float32 False None None
store_and_fwd_flag String False None None
PULocationID Int32 False None None
DOLocationID Int32 False None None
payment_type Int32 False None None
fare_amount Float32 False None None
extra Float32 False None None
mta_tax Float32 False None None
tip_amount Float32 False None None
tolls_amount Float32 False None None
improvement_surcharge Float32 False None None
total_amount Float32 False None None
congestion_surcharge Float32 False None None
airport_fee Float32 False None None

Plotting#

Let’s compute the 99th quantile of the trip_distance column to remove outliers:

%%sql
SELECT quantile(0.99)(trip_distance)
FROM taxi
*  clickhouse+native://username:***@localhost/my_database
Done.
quantile(0.99)(trip_distance)
19.21179912567139

We now use --save to store this SQL SELECT statement:

%%sql --save no_outliers --no-execute
SELECT trip_distance
FROM taxi
WHERE trip_distance < 18.7
*  clickhouse+native://username:***@localhost/my_database
Skipping execution...

Now, we can pass it to the plotting command:

%sqlplot histogram --table no_outliers --column trip_distance --with no_outliers
<Axes: title={'center': "'trip_distance' from 'no_outliers'"}, xlabel='trip_distance', ylabel='Count'>
../_images/9838428a627e22edd266e278a762e506aff14f5a1ac182700c0f7969ceb9ba75.png

Clean up#

To stop and remove the container:

%%bash
docker container ls
CONTAINER ID   IMAGE                          COMMAND            CREATED          STATUS          PORTS                                        NAMES
cefe171d72a8   clickhouse/clickhouse-server   "/entrypoint.sh"   51 seconds ago   Up 49 seconds   8123/tcp, 9009/tcp, 0.0.0.0:9000->9000/tcp   clickhouse
%%bash
docker container stop clickhouse
clickhouse
%%bash
docker container rm clickhouse
clickhouse
%%bash
docker container ls
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES