PostgreSQL#
This tutorial will show you how to get a PostgreSQL instance up and running locally to test JupySQL. You can run this in a Jupyter notebook.
Pre-requisites#
To run this tutorial, you need to install following Python packages:
%pip install jupysql pandas pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.
You also need a PostgreSQL connector. Here’s a list of supported connectors. We recommend using psycopg2
. The easiest way to install it is via:
%pip install psycopg2-binary --quiet
Note: you may need to restart the kernel to use updated packages.
Tip
If you have issues, check out our installation guide or message us on Slack.
You also need Docker installed and running to start the PostgreSQL instance.
Start PostgreSQL instance#
We fetch the official image, create a new database, and user (this will take 1-2 minutes):
%%bash
docker run --name postgres -e POSTGRES_DB=db \
-e POSTGRES_USER=user \
-e POSTGRES_PASSWORD=password \
-p 5432:5432 -d postgres
94fa1f440c4c8e632b59fc630dd513c4d653c95c964fd4deddf3430db1223c1b
Our database is running, let’s load some data!
Load sample data#
Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:
import pandas as pd
df = pd.read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape
(1369769, 19)
As you can see, this chunk of data contains ~1.4M rows, loading the data will take about a minute:
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:password@localhost/db")
df.to_sql(name="taxi", con=engine, chunksize=100_000)
engine.dispose()
Query#
Now, let’s start JuppySQL, authenticate and start querying the data!
%load_ext sql
%sql postgresql://user:password@localhost/db
Important
If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!
List the tables in the database:
%sqlcmd tables
Name |
---|
taxi |
List columns in the taxi table:
%sqlcmd columns --table taxi
name | type | nullable | default | autoincrement | comment |
---|---|---|---|---|---|
index | BIGINT | True | None | False | None |
VendorID | BIGINT | True | None | False | None |
tpep_pickup_datetime | TIMESTAMP | True | None | False | None |
tpep_dropoff_datetime | TIMESTAMP | True | None | False | None |
passenger_count | DOUBLE_PRECISION | True | None | False | None |
trip_distance | DOUBLE_PRECISION | True | None | False | None |
RatecodeID | DOUBLE_PRECISION | True | None | False | None |
store_and_fwd_flag | TEXT | True | None | False | None |
PULocationID | BIGINT | True | None | False | None |
DOLocationID | BIGINT | True | None | False | None |
payment_type | BIGINT | True | None | False | None |
fare_amount | DOUBLE_PRECISION | True | None | False | None |
extra | DOUBLE_PRECISION | True | None | False | None |
mta_tax | DOUBLE_PRECISION | True | None | False | None |
tip_amount | DOUBLE_PRECISION | True | None | False | None |
tolls_amount | DOUBLE_PRECISION | True | None | False | None |
improvement_surcharge | DOUBLE_PRECISION | True | None | False | None |
total_amount | DOUBLE_PRECISION | True | None | False | None |
congestion_surcharge | DOUBLE_PRECISION | True | None | False | None |
airport_fee | DOUBLE_PRECISION | True | None | False | None |
Query our data:
%%sql
SELECT COUNT(*) FROM taxi
* postgresql://user:***@localhost/db
1 rows affected.
count |
---|
1369769 |
Parametrize queries#
threshold = 10
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
* postgresql://user:***@localhost/db
1 rows affected.
count |
---|
1297415 |
threshold = 0.5
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
* postgresql://user:***@localhost/db
1 rows affected.
count |
---|
73849 |
CTEs#
%%sql --save many_passengers --no-execute
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93
* postgresql://user:***@localhost/db
Skipping execution...
%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
* postgresql://user:***@localhost/db
1 rows affected.
min | avg | max |
---|---|---|
0.0 | 2.5010889812889756 | 18.92 |
This is what JupySQL executes:
query = %sqlrender trip_stats
print(query)
WITH "many_passengers" AS (
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
Plotting#
%sqlplot histogram --table taxi --column trip_distance
<Axes: title={'center': "'trip_distance' from 'taxi'"}, xlabel='trip_distance', ylabel='Count'>

%sqlplot boxplot --table taxi --column trip_distance
<Axes: title={'center': "'trip_distance' from 'taxi'"}, ylabel='trip_distance'>

Executing Meta-Commands#
JupySQL supports psql
-style “backslash” meta-commands (\d
, \dt
, etc.). To run these, PGSpecial must be installed— information on how to do so can be found here. Example:
%sql \dt
* postgresql://user:***@localhost/db
1 rows affected.
Schema | Name | Type | Owner |
---|---|---|---|
public | taxi | table | user |
Clean up#
To stop and remove the container:
! docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4a6478b19d45 mariadb:latest "docker-entrypoint.s…" 21 minutes ago Up 21 minutes 0.0.0.0:3306->3306/tcp mariadb
%%capture out
! docker container ls --filter ancestor=postgres --quiet
container_id = out.stdout.strip()
print(f"Container id: {container_id}")
Container id:
! docker container stop {container_id}
"docker container stop" requires at least 1 argument.
See 'docker container stop --help'.
Usage: docker container stop [OPTIONS] CONTAINER [CONTAINER...]
Stop one or more running containers
! docker container rm {container_id}
"docker container rm" requires at least 1 argument.
See 'docker container rm --help'.
Usage: docker container rm [OPTIONS] CONTAINER [CONTAINER...]
Remove one or more containers
! docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4a6478b19d45 mariadb:latest "docker-entrypoint.s…" 21 minutes ago Up 21 minutes 0.0.0.0:3306->3306/tcp mariadb