Trino

Trino#

This tutorial will show you how to get a Trino (f.k.a PrestoSQL) instance up and running locally to test JupySQL. You can run this in a Jupyter notebook.

Pre-requisites#

To run this tutorial, you need to install following Python packages:

%pip install jupysql trino pandas pyarrow --quiet

Note: you may need to restart the kernel to use updated packages.

You also need a Trino connector. Here is the supported connector. You can install it with:

%pip install sqlalchemy-trino --quiet

Note: you may need to restart the kernel to use updated packages.

Start Trino instance#

We fetch the official image, create a new database, and user (this will take a few seconds).

%%bash
docker run -p 8080:8080 --name trino -d trinodb/trino

cba8365556d3f35dd56cfd06747276ef1c7b7661eb4268b74e665d8d4d44a7e7

Our database is running, let’s load some data!

Load sample data#

Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:

import pandas as pd

df = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape

(1369769, 19)

Trino has maximum query text length of 1000000. Therefore, writing the whole NYC taxi dataset (~1.4M rows) will throw errors. A workaround is to increase the http-server.max-request-size configuration parameter to Trino’s maximum allowed characters of 1,000,000,000 in the Trino server configuration file (config.properties). We’ll write a subset of the data instead:

df = df.head(1000)

Trino uses a schema named “default” to store tables. Therefore, schema='default' is required in the connection string.

from sqlalchemy import create_engine

engine = create_engine(
    "trino://user@localhost:8080/memory", connect_args={"user": "user"}
)

df.to_sql(
    con=engine,
    name="taxi",
    schema="default",
    method="multi",
    index=False,
)

engine.dispose()

Query#

Now, let’s start JupySQL, authenticate, and query the data!

%load_ext sql

Tip: You may define configurations in /Users/neelashasen/Dev/jupysql_master/jupysql/pyproject.toml. Please review our configuration guideline.

Connecting to 'default'

%sql trino://user@localhost:8080/memory

Connecting and switching to connection 'trino://user@localhost:8080/memory'

Important

If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!

List the tables in the database:

%sqlcmd tables --schema default

Name
taxi

List columns in the taxi table:

%sqlcmd columns --table taxi --schema default

name	type	nullable	default
vendorid	BIGINT	True	None
tpep_pickup_datetime	TIMESTAMP	True	None
tpep_dropoff_datetime	TIMESTAMP	True	None
passenger_count	DOUBLE	True	None
trip_distance	DOUBLE	True	None
ratecodeid	DOUBLE	True	None
store_and_fwd_flag	VARCHAR	True	None
pulocationid	BIGINT	True	None
dolocationid	BIGINT	True	None
payment_type	BIGINT	True	None
fare_amount	DOUBLE	True	None
extra	DOUBLE	True	None
mta_tax	DOUBLE	True	None
tip_amount	DOUBLE	True	None
tolls_amount	DOUBLE	True	None
improvement_surcharge	DOUBLE	True	None
total_amount	DOUBLE	True	None
congestion_surcharge	DOUBLE	True	None
airport_fee	DOUBLE	True	None

Query our data:

%%sql
SELECT COUNT(*) FROM default.taxi

Running query in 'trino://user@localhost:8080/memory'

_col0
1000

Parameterize queries#

threshold = 10

%%sql
SELECT COUNT(*) FROM default.taxi
WHERE trip_distance < {{threshold}}

Running query in 'trino://user@localhost:8080/memory'

_col0
949

threshold = 0.5

%%sql
SELECT COUNT(*) FROM default.taxi
WHERE trip_distance < {{threshold}}

Running query in 'trino://user@localhost:8080/memory'

_col0
64

CTEs#

%%sql --save many_passengers --no-execute
SELECT *
FROM default.taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93

Running query in 'trino://user@localhost:8080/memory'

Skipping execution...

%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Running query in 'trino://user@localhost:8080/memory'

_col0	_col1	_col2
0.25	3.16470588235294	11.15

This is what JupySQL executes:

query = %sqlcmd snippets trip_stats
print(query)

WITH many_passengers AS (
SELECT *
FROM default.taxi
WHERE passenger_count > 3

AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Plotting#

The %sqlplot magic command currently does not directly support the --schema option for specifying the schema name. To work around this, you can specify the schema in the SQL query itself.

result = %sql SELECT trip_distance FROM default.taxi

import matplotlib.pyplot as plt

data = result.DataFrame()

plt.hist(data["trip_distance"])
plt.xlabel("Trip Distance")
plt.ylabel("Frequency")
plt.title("Histogram of Trip Distance")
plt.show()

Running query in 'trino://user@localhost:8080/memory'

../_images/016bf72fcaa6959099dbf5f89144d9a089bb68d1a1117cf89c9e1675256dc70d.png

result = %sql SELECT trip_distance FROM default.taxi

import matplotlib.pyplot as plt

data = result.DataFrame()

plt.boxplot(data["trip_distance"])
plt.xlabel("Trip Distance")
plt.ylabel("Value")
plt.title("Boxplot of Trip Distance")
plt.show()

Running query in 'trino://user@localhost:8080/memory'

../_images/eb3b71a8f91c96badbae678b455679c726b5293f84123f0be1c6d49b845fc4b3.png

Persist#

result = %sql SELECT * FROM default.taxi WHERE passenger_count > 3

Running query in 'trino://user@localhost:8080/memory'

df = result.DataFrame()

We need to pass --no-index since index creation is not supported in Trino DB.

%sql --persist default.df --no-index

Running query in 'trino://user@localhost:8080/memory'

Success! Persisted df to the database.

%sql SELECT * FROM default.df LIMIT 5

Running query in 'trino://user@localhost:8080/memory'

vendorid	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	ratecodeid	store_and_fwd_flag	pulocationid	dolocationid	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge	airport_fee
2	2021-01-01 00:31:06	2021-01-01 00:38:52	5.0	1.7	1.0	N	142	50	1	8.0	0.5	0.5	2.36	0.3	14.16	2.5	None
2	2021-01-01 00:42:11	2021-01-01 00:44:24	5.0	0.81	1.0	N	50	142	2	4.5	0.5	0.5	0.0	0.3	8.3	2.5	None
2	2021-01-01 00:31:06	2021-01-01 00:38:52	5.0	1.7	1.0	N	142	50	1	8.0	0.5	0.5	2.36	0.3	14.16	2.5	None
2	2021-01-01 00:42:11	2021-01-01 00:44:24	5.0	0.81	1.0	N	50	142	2	4.5	0.5	0.5	0.0	0.3	8.3	2.5	None
2	2021-01-01 00:34:37	2021-01-01 00:47:22	4.0	3.15	1.0	N	238	162	1	12.5	0.5	0.5	2.0	0.3	18.3	2.5	None

Clean up#

To stop and remove the container:

! docker container ls

CONTAINER ID   IMAGE           COMMAND                  CREATED         STATUS                   PORTS                    NAMES
cba8365556d3   trinodb/trino   "/usr/lib/trino/bin/…"   2 minutes ago   Up 2 minutes (healthy)   0.0.0.0:8080->8080/tcp   trino

%%capture out
! docker container ls --filter ancestor=trinodb/trino --quiet

container_id = out.stdout.strip()
print(f"Container id: {container_id}")

Container id: cba8365556d3

! docker container stop {container_id}

cba8365556d3

! docker container rm {container_id}

cba8365556d3

! docker container ls

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Trino

Contents

Trino#

Pre-requisites#

Start Trino instance#

Load sample data#

Query#

Parameterize queries#

CTEs#

Plotting#

Persist#

Clean up#