Snowflake#

Important

snowflake-sqlalchemy requires SQLAlchemy 1.x (as of version 1.4.7 )

Snowflake is a cloud-based data warehousing platform that provides organizations with a powerful and flexible solution for storing, managing, and analyzing large amounts of data. Unlike traditional data warehouses, Snowflake operates entirely in the cloud, utilizing a distributed architecture that allows it to process and store data across multiple computing resources.

In this guide, we’ll demonstrate how to integrate with Snowflake using JupySQL magics.

Tip

If you encounter any issues, feel free to join our community and we’ll be happy to help!

Pre-requisites#

We will need the snowflake-sqlalchemy package for connecting to the warehouse.

%pip install --upgrade snowflake-sqlalchemy 'sqlalchemy<2' --quiet
Note: you may need to restart the kernel to use updated packages.

Now let’s define the URL connection parameters and create an Engine object.

from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL


engine = create_engine(
    URL(
        drivername="driver",
        user="user",
        password="password",
        account="account",
        database="database",
        role="role",
        schema="schema",
        warehouse="warehouse",
    )
)

Load sample data#

Now, let’s load the penguins dataset. We’ll convert this .csv file to a dataframe and create a table in Snowflake database from the data.

import urllib.request

urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv",
    "penguins.csv",
)
('penguins.csv', <http.client.HTTPMessage at 0x12ea1f100>)
%load_ext sql
%sql engine --alias connection
import pandas as pd

df = pd.read_csv("penguins.csv")
connection = engine.connect()
df.to_sql(name="penguins", con=connection, index=False, if_exists="replace")
344

Query#

List the tables in the database:

%sqlcmd tables
Name
penguins

List columns in the penguins table:

%sqlcmd columns --table penguins
name type nullable default autoincrement comment primary_key
species VARCHAR(16777216) True None False None False
island VARCHAR(16777216) True None False None False
bill_length_mm FLOAT True None False None False
bill_depth_mm FLOAT True None False None False
flipper_length_mm FLOAT True None False None False
body_mass_g FLOAT True None False None False
sex VARCHAR(16777216) True None False None False

Query our data:

%%sql
SELECT COUNT(*) FROM penguins 
Running query in 'connection'
1 rows affected.
COUNT(*)
344
%%sql
SELECT species, COUNT(*) AS count
FROM penguins
GROUP BY species
ORDER BY count DESC
Running query in 'connection'
3 rows affected.
species count
Adelie 152
Gentoo 124
Chinstrap 68

Parametrize queries#

JupySQL supports variable expansion in this format: {{variable}}.

dynamic_limit = 5
dynamic_column = "island, sex"
%sql SELECT {{dynamic_column}} FROM penguins LIMIT {{dynamic_limit}}
Running query in 'connection'
5 rows affected.
island sex
Torgersen MALE
Torgersen FEMALE
Torgersen FEMALE
Torgersen None
Torgersen FEMALE

CTEs#

Using JupySQL we can save query snippets, and use these saved snippets to form larger queries. Let’s see CTEs in action:

%%sql --save no_nulls --no-execute
SELECT *
FROM penguins
WHERE body_mass_g IS NOT NULL and
sex IS NOT NULL
Running query in 'connection'
Skipping execution...
%%sql
SELECT island, avg(body_mass_g) as avg_body_mass_g
FROM no_nulls
GROUP BY island;
Generating CTE with stored snippets : no_nulls
Running query in 'connection'
3 rows affected.
island avg_body_mass_g
Torgersen 3708.5106382978724
Biscoe 4719.171779141105
Dream 3718.9024390243903

The query gets compiled like so:

final = %sqlcmd snippets no_nulls
print(final)
WITH
SELECT *
FROM penguins
WHERE body_mass_g IS NOT NULL and
sex IS NOT NULL

Plotting#

%sqlplot histogram --table penguins --column bill_length_mm
<Axes: title={'center': "'bill_length_mm' from 'penguins'"}, xlabel='bill_length_mm', ylabel='Count'>
../_images/2e184f0b05cf33ce5860c2c01df60ba379cf2e74a1c4b766bdf013ead23a24ee.png

Clean up#

To ensure that the Python connector closes the session properly, execute connection.close() before engine.dispose(). This prevents the garbage collector from removing the resources required to communicate with Snowflake.

connection.close()
engine.dispose()