PostgreSQL#

This tutorial will show you how to get a PostgreSQL instance up and running locally to test JupySQL. You can run this in a Jupyter notebook.

Pre-requisites#

To run this tutorial, you need to install following Python packages:

%pip install jupysql pandas pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.

You also need a PostgreSQL connector. Here’s a list of supported connectors. We recommend using psycopg2. The easiest way to install it is via:

%pip install psycopg2-binary --quiet
Note: you may need to restart the kernel to use updated packages.

Tip

If you have issues, check out our installation guide or message us on Slack.

You also need Docker installed and running to start the PostgreSQL instance.

Start PostgreSQL instance#

We fetch the official image, create a new database, and user (this will take 1-2 minutes):

%%bash
docker run --name postgres -e POSTGRES_DB=db \
  -e POSTGRES_USER=user \
  -e POSTGRES_PASSWORD=password \
  -p 5432:5432 -d postgres
94fa1f440c4c8e632b59fc630dd513c4d653c95c964fd4deddf3430db1223c1b

Our database is running, let’s load some data!

Load sample data#

Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:

import pandas as pd

df = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape
(1369769, 19)

As you can see, this chunk of data contains ~1.4M rows, loading the data will take about a minute:

from sqlalchemy import create_engine

engine = create_engine("postgresql://user:password@localhost/db")
df.to_sql(name="taxi", con=engine, chunksize=100_000)
engine.dispose()

Query#

Now, let’s start JupySQL, authenticate and start querying the data!

%load_ext sql
%sql postgresql://user:password@localhost/db

Important

If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!

List the tables in the database:

%sqlcmd tables
Name
taxi

List columns in the taxi table:

%sqlcmd columns --table taxi
name type nullable default autoincrement comment
index BIGINT True None False None
VendorID BIGINT True None False None
tpep_pickup_datetime TIMESTAMP True None False None
tpep_dropoff_datetime TIMESTAMP True None False None
passenger_count DOUBLE_PRECISION True None False None
trip_distance DOUBLE_PRECISION True None False None
RatecodeID DOUBLE_PRECISION True None False None
store_and_fwd_flag TEXT True None False None
PULocationID BIGINT True None False None
DOLocationID BIGINT True None False None
payment_type BIGINT True None False None
fare_amount DOUBLE_PRECISION True None False None
extra DOUBLE_PRECISION True None False None
mta_tax DOUBLE_PRECISION True None False None
tip_amount DOUBLE_PRECISION True None False None
tolls_amount DOUBLE_PRECISION True None False None
improvement_surcharge DOUBLE_PRECISION True None False None
total_amount DOUBLE_PRECISION True None False None
congestion_surcharge DOUBLE_PRECISION True None False None
airport_fee DOUBLE_PRECISION True None False None

Query our data:

%%sql
SELECT COUNT(*) FROM taxi
*  postgresql://user:***@localhost/db
1 rows affected.
count
1369769

Parametrize queries#

threshold = 10
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
*  postgresql://user:***@localhost/db
1 rows affected.
count
1297415
threshold = 0.5
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
*  postgresql://user:***@localhost/db
1 rows affected.
count
73849

CTEs#

%%sql --save many_passengers --no-execute
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93
*  postgresql://user:***@localhost/db
Skipping execution...
%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
*  postgresql://user:***@localhost/db
1 rows affected.
min avg max
0.0 2.5010889812889756 18.92

This is what JupySQL executes:

query = %sqlcmd snippets trip_stats
print(query)
WITH "many_passengers" AS (
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Plotting#

%sqlplot histogram --table taxi --column trip_distance
<Axes: title={'center': "'trip_distance' from 'taxi'"}, xlabel='trip_distance', ylabel='Count'>
../_images/36145f060a6c90a5c3cc368814eed707d15f6256db09f3f44eb8f5cc3bb8a22f.png
%sqlplot boxplot --table taxi --column trip_distance
<Axes: title={'center': "'trip_distance' from 'taxi'"}, ylabel='trip_distance'>
../_images/1b4c104178fdb39306fae963507ca4039848c1bfed7107ea731839696d235741.png

Executing Meta-Commands#

JupySQL supports psql-style “backslash” meta-commands (\d, \dt, etc.). To run these, PGSpecial must be installed— information on how to do so can be found here. Example:

%sql \dt
*  postgresql://user:***@localhost/db
1 rows affected.
Schema Name Type Owner
public taxi table user

Clean up#

To stop and remove the container:

! docker container ls
CONTAINER ID   IMAGE            COMMAND                  CREATED          STATUS          PORTS                    NAMES
4a6478b19d45   mariadb:latest   "docker-entrypoint.s…"   21 minutes ago   Up 21 minutes   0.0.0.0:3306->3306/tcp   mariadb
%%capture out
! docker container ls --filter ancestor=postgres --quiet
container_id = out.stdout.strip()
print(f"Container id: {container_id}")
Container id: 
! docker container stop {container_id}
"docker container stop" requires at least 1 argument.
See 'docker container stop --help'.

Usage:  docker container stop [OPTIONS] CONTAINER [CONTAINER...]

Stop one or more running containers
! docker container rm {container_id}
"docker container rm" requires at least 1 argument.
See 'docker container rm --help'.

Usage:  docker container rm [OPTIONS] CONTAINER [CONTAINER...]

Remove one or more containers
! docker container ls
CONTAINER ID   IMAGE            COMMAND                  CREATED          STATUS          PORTS                    NAMES
4a6478b19d45   mariadb:latest   "docker-entrypoint.s…"   21 minutes ago   Up 21 minutes   0.0.0.0:3306->3306/tcp   mariadb