Redshift#

Important

sqlalchemy-redshift requires SQLAlchemy 1.x (as of version 0.8.14)

This tutorial will show you how to use JupySQL with Redshift, a data warehouse service provided by AWS.

Pre-requisites#

First, let’s install the required packages.

%pip install jupysql sqlalchemy-redshift redshift-connector 'sqlalchemy<2' --quiet
Note: you may need to restart the kernel to use updated packages.

Load JupySQL:

%load_ext sql
Found pyproject.toml from '/Users/eduardo/dev/jupysql'

Connect to Redshift#

Here, we create a connection and pass it to JupySQL:

from os import environ
from sqlalchemy import create_engine
from sqlalchemy.engine import URL

user = environ["REDSHIFT_USERNAME"]
password = environ["REDSHIFT_PASSWORD"]
host = environ["REDSHIFT_HOST"]

url = URL.create(
    drivername="redshift+redshift_connector",
    username=user,
    password=password,
    host=host,
    port=5439,
    database="dev",
)

engine = create_engine(url)
%sql engine --alias redshift-sqlalchemy

Load data#

We’ll load some sample data. First, we create the table:

%%sql
DROP TABLE taxi;

CREATE TABLE taxi  (
    VendorID                BIGINT,
    tpep_pickup_datetime    TIMESTAMP,
    tpep_dropoff_datetime   TIMESTAMP,
    passenger_count         DOUBLE PRECISION,
    trip_distance           DOUBLE PRECISION,
    RatecodeID              DOUBLE PRECISION,
    store_and_fwd_flag      VARCHAR(1),
    PULocationID            BIGINT,
    DOLocationID            BIGINT,
    payment_type            BIGINT,
    fare_amount             DOUBLE PRECISION,
    extra                   DOUBLE PRECISION,
    mta_tax                 DOUBLE PRECISION,
    tip_amount              DOUBLE PRECISION,
    tolls_amount            DOUBLE PRECISION,
    improvement_surcharge   DOUBLE PRECISION,
    total_amount            DOUBLE PRECISION,
    congestion_surcharge    DOUBLE PRECISION,
    airport_fee             DOUBLE PRECISION
);
Running query in 'redshift-sqlalchemy'
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Now, we use COPY to copy a .parquet file stored in an S3 bucket:

%%sql
COPY taxi
FROM 's3:///some-bucket/yellow_tripdata_2023-01.parquet'
IAM_ROLE 'arn:aws:iam::XYZ:role/some-role'
FORMAT AS PARQUET;

Query#

%%sql
SELECT * FROM taxi LIMIT 3
Running query in 'redshift-sqlalchemy'
vendorid tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance ratecodeid store_and_fwd_flag pulocationid dolocationid payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
2 2023-01-01 01:11:31 2023-01-01 01:21:50 1.0 4.59 1.0 N 132 139 2 -20.5 -1.0 -0.5 0.0 0.0 -1.0 -24.25 0.0 -1.25
2 2023-01-01 01:11:31 2023-01-01 01:21:50 1.0 4.59 1.0 N 132 139 2 20.5 1.0 0.5 0.0 0.0 1.0 24.25 0.0 1.25
2 2023-01-01 01:06:46 2023-01-01 01:42:58 5.0 6.8 1.0 N 68 179 1 36.6 1.0 0.5 8.32 0.0 1.0 49.92 2.5 0.0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Pandas/Polars integration#

Tip

Learn more about the pandas and polars integrations.

You can convert results to pandas and polars data frames

%%sql results <<
SELECT tpep_pickup_datetime, tpep_dropoff_datetime FROM taxi LIMIT 100
Running query in 'redshift-sqlalchemy'
results.DataFrame().head()
tpep_pickup_datetime tpep_dropoff_datetime
0 2023-01-01 01:30:53 2023-01-01 02:03:29
1 2023-01-01 01:28:54 2023-01-01 01:53:11
2 2023-01-01 01:54:52 2023-01-01 02:00:54
3 2023-01-01 01:25:54 2023-01-01 01:35:49
4 2023-01-01 01:54:10 2023-01-01 02:11:43
results.PolarsDataFrame().head()
shape: (5, 2)
tpep_pickup_datetimetpep_dropoff_datetime
datetime[μs]datetime[μs]
2023-01-01 01:30:532023-01-01 02:03:29
2023-01-01 01:28:542023-01-01 01:53:11
2023-01-01 01:54:522023-01-01 02:00:54
2023-01-01 01:25:542023-01-01 01:35:49
2023-01-01 01:54:102023-01-01 02:11:43

List tables#

%sqlcmd tables
Name
taxi

List columns#

%sqlcmd columns --table taxi
name type nullable default autoincrement comment info
vendorid BIGINT True None False None {'encode': 'az64'}
tpep_pickup_datetime TIMESTAMP True None False None {'encode': 'az64'}
tpep_dropoff_datetime TIMESTAMP True None False None {'encode': 'az64'}
passenger_count DOUBLE_PRECISION True None False None {}
trip_distance DOUBLE_PRECISION True None False None {}
ratecodeid DOUBLE_PRECISION True None False None {}
store_and_fwd_flag VARCHAR(1) True None False None {'encode': 'lzo'}
pulocationid BIGINT True None False None {'encode': 'az64'}
dolocationid BIGINT True None False None {'encode': 'az64'}
payment_type BIGINT True None False None {'encode': 'az64'}
fare_amount DOUBLE_PRECISION True None False None {}
extra DOUBLE_PRECISION True None False None {}
mta_tax DOUBLE_PRECISION True None False None {}
tip_amount DOUBLE_PRECISION True None False None {}
tolls_amount DOUBLE_PRECISION True None False None {}
improvement_surcharge DOUBLE_PRECISION True None False None {}
total_amount DOUBLE_PRECISION True None False None {}
congestion_surcharge DOUBLE_PRECISION True None False None {}
airport_fee DOUBLE_PRECISION True None False None {}

Profile a dataset#

%sqlcmd profile --table taxi
Following statistics are not available in redshift_connector: STD, 25%, 50%, 75%
vendorid tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance ratecodeid store_and_fwd_flag pulocationid dolocationid payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge airport_fee
count 3066766 3066766 3066766 2995023 3066766 2995023 2995023 3066766 3066766 3066766 3066766 3066766 3066766 3066766 3066766 3066766 3066766 2995023 2995023
unique 2 1610975 1611319 10 4387 7 2 257 261 5 6873 68 10 4036 776 5 15871 3 3
mean 1.0000 nan nan 1.3625 3.8473 1.4974 nan 166.0000 164.0000 1.0000 18.3671 1.5378 0.4883 3.3679 0.5185 0.9821 27.0204 2.2742 0.1074
min 1 nan nan 0.0 0.0 1.0 nan 1 1 0 -900.0 -7.5 -0.5 -96.22 -65.0 -1.0 -751.0 -2.5 -1.25
max 2 nan nan 9.0 258928.15 99.0 nan 265 265 4 1160.1 12.5 53.16 380.8 196.99 1.0 1169.4 2.5 1.25

Plotting#

Let’s create a histogram for the trip_distance. Since there are outliers, we’ll use the 99th percentile as a cutoff value.

%%sql
SELECT 
    APPROXIMATE PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY trip_distance)
FROM 
    taxi;
Running query in 'redshift-sqlalchemy'
percentile_disc
20.0
ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

Let’s create a new snippet by filtering out the outliers using --save:

%%sql --save taxi_no_outliers --no-execute
select * from taxi  where trip_distance < 20
Running query in 'redshift-sqlalchemy'
Skipping execution...

Histogram#

%sqlplot histogram --table taxi_no_outliers --column trip_distance
Plotting using saved snippet : taxi_no_outliers
<Axes: title={'center': "'trip_distance' from 'taxi_no_outliers'"}, xlabel='trip_distance', ylabel='Count'>
../_images/aa83f9771011b1a19f34d93472a141cb4d52b6fd41ae5213d9618fda42a906b6.png

Boxplot#

%sqlplot boxplot --table taxi_no_outliers --column trip_distance
Plotting using saved snippet : taxi_no_outliers
<Axes: title={'center': "'trip_distance' from 'taxi_no_outliers'"}, ylabel='trip_distance'>
../_images/3c16b361a3a2fb1bbf11e779920f0ff84ecbdac0eb0f9182fc49ca9362f49390.png

Bar#

%sqlplot bar --table taxi --column passenger_count
Removing NULLs, if there exists any from passenger_count
<Axes: title={'center': 'taxi'}, xlabel='passenger_count', ylabel='Count'>
../_images/fd9c710b410dd5e6b682195facfca9892baa526e1d78002b2133997e84eafa32.png

Plotting using the ggplot API#

You can also use the ggplot API to create visualizations:

from sql.ggplot import ggplot, aes, geom_histogram
(
    ggplot("taxi_no_outliers", aes(x="trip_distance"), with_="taxi_no_outliers")
    + geom_histogram(bins=30, fill="vendorid")
)
<sql.ggplot.ggplot.ggplot at 0x15e0bae30>
../_images/df9765dde2eee5c85fd4ca5496cb14f4e930abbe9318f6f47255b19321d17256.png

Using a native connection#

Using a native connection is also supported.

%pip install redshift-connector --quiet
Note: you may need to restart the kernel to use updated packages.
import redshift_connector

conn = redshift_connector.connect(
    host=host,
    database="dev",
    port=5439,
    user=user,
    password=password,
    timeout=60,
)
%sql conn --alias redshift-native