Trino#

This tutorial will show you how to get a Trino (f.k.a PrestoSQL) instance up and running locally to test JupySQL. You can run this in a Jupyter notebook.

Pre-requisites#

To run this tutorial, you need to install following Python packages:

%pip install jupysql trino pandas pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.

You also need a Trino connector. Here is the supported connector. You can install it with:

%pip install sqlalchemy-trino --quiet
Note: you may need to restart the kernel to use updated packages.

Start Trino instance#

We fetch the official image, create a new database, and user (this will take a few seconds).

%%bash
docker run -p 8080:8080 --name trino -d trinodb/trino
f9e0890fa2dea4de751347f4c6a4194dbbbdc7865535caeea0ae27e1f20245e0

Our database is running, let’s load some data!

Load sample data#

Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:

import pandas as pd

df = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape
(1369769, 19)

Trino has maximum query text length of 1000000. Therefore, writing the whole NYC taxi dataset (~1.4M rows) will throw errors. A workaround is to increase the http-server.max-request-size configuration parameter to Trino’s maximum allowed characters of 1,000,000,000 in the Trino server configuration file (config.properties). We’ll write a subset of the data instead:

df = df.head(1000)

Trino uses a schema named “default” to store tables. Therefore, schema='default' is required in the connection string.

from sqlalchemy import create_engine

engine = create_engine(
    "trino://user@localhost:8080/memory", connect_args={"user": "user"}
)

df.to_sql(
    con=engine,
    name="taxi",
    schema="default",
    method="multi",
    index=False,
)

engine.dispose()
/var/folders/r6/q0qh39ts70582wth4g1dhjhm0000gn/T/ipykernel_48433/191114213.py:3: SADeprecationWarning: The dbapi() classmethod on dialect classes has been renamed to import_dbapi().  Implement an import_dbapi() classmethod directly on class <class 'trino.sqlalchemy.dialect.TrinoDialect'> to remove this warning; the old .dbapi() classmethod may be maintained for backwards compatibility.
  engine = create_engine("trino://user@localhost:8080/memory", connect_args={"user": "user"})

Query#

Now, let’s start JupySQL, authenticate, and query the data!

%load_ext sql
%sql trino://user@localhost:8080/memory

Important

If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!

List the tables in the database:

%sqlcmd tables --schema default
Name
taxi

List columns in the taxi table:

%sqlcmd columns --table taxi --schema default
name type nullable default
vendorid BIGINT True None
tpep_pickup_datetime TIMESTAMP True None
tpep_dropoff_datetime TIMESTAMP True None
passenger_count DOUBLE True None
trip_distance DOUBLE True None
ratecodeid DOUBLE True None
store_and_fwd_flag VARCHAR True None
pulocationid BIGINT True None
dolocationid BIGINT True None
payment_type BIGINT True None
fare_amount DOUBLE True None
extra DOUBLE True None
mta_tax DOUBLE True None
tip_amount DOUBLE True None
tolls_amount DOUBLE True None
improvement_surcharge DOUBLE True None
total_amount DOUBLE True None
congestion_surcharge DOUBLE True None
airport_fee DOUBLE True None

Query our data:

%%sql
SELECT COUNT(*) FROM default.taxi
*  trino://user@localhost:8080/memory
Done.
_col0
1000

Parameterize queries#

threshold = 10
%%sql
SELECT COUNT(*) FROM default.taxi
WHERE trip_distance < {{threshold}}
*  trino://user@localhost:8080/memory
Done.
_col0
949
threshold = 0.5
%%sql
SELECT COUNT(*) FROM default.taxi
WHERE trip_distance < {{threshold}}
*  trino://user@localhost:8080/memory
Done.
_col0
64

CTEs#

%%sql --save many_passengers --no-execute
SELECT *
FROM default.taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93
*  trino://user@localhost:8080/memory
Skipping execution...
%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
*  trino://user@localhost:8080/memory
Done.
_col0 _col1 _col2
0.25 3.16470588235294 11.15

This is what JupySQL executes:

query = %sqlcmd snippets trip_stats
print(query)
WITH many_passengers AS (
SELECT *
FROM default.taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Plotting#

The %sqlplot magic command currently does not directly support the --schema option for specifying the schema name. To work around this, you can specify the schema in the SQL query itself.

result = %sql SELECT trip_distance FROM default.taxi

import matplotlib.pyplot as plt

data = result.DataFrame()

plt.hist(data["trip_distance"])
plt.xlabel("Trip Distance")
plt.ylabel("Frequency")
plt.title("Histogram of Trip Distance")
plt.show()
*  trino://user@localhost:8080/memory
Done.
../_images/6c3eb1a3910970f85731c30d5d303a87b7aaf81a5b7fb1e4ae897530b8a3c06b.png
result = %sql SELECT trip_distance FROM default.taxi

import matplotlib.pyplot as plt

data = result.DataFrame()

plt.boxplot(data["trip_distance"])
plt.xlabel("Trip Distance")
plt.ylabel("Value")
plt.title("Boxplot of Trip Distance")
plt.show()
*  trino://user@localhost:8080/memory
Done.
../_images/92da41001b35ad0a628d14384d5965e02d44da0fdbd729767b11f9c130ead38d.png

Clean up#

To stop and remove the container:

! docker container ls
CONTAINER ID   IMAGE           COMMAND                  CREATED          STATUS                    PORTS                    NAMES
819a48a82c7d   trinodb/trino   "/usr/lib/trino/bin/…"   33 seconds ago   Up 33 seconds (healthy)   0.0.0.0:8080->8080/tcp   trino
%%capture out
! docker container ls --filter ancestor=trinodb/trino --quiet
container_id = out.stdout.strip()
print(f"Container id: {container_id}")
Container id: 819a48a82c7d
! docker container stop {container_id}
819a48a82c7d
! docker container rm {container_id}
819a48a82c7d
! docker container ls
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES