Microsoft SQL Server

Microsoft SQL Server#

In this tutorial, we’ll see how to query Microsoft SQL Server from Jupyter. Optionally, you can spin up a testing server.

Tip

If you encounter issues, feel free to join our community and we’ll be happy to help!

Pre-requisites#

The first step is to install the ODBC driver for SQL Server.

Instructions for Linux
Instructions for Mac

For example, if you’re on a Mac, you can install the driver with brew:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
brew update
HOMEBREW_ACCEPT_EULA=Y brew install msodbcsql18 mssql-tools18

Start Microsoft SQL Server instance#

If you don’t have a SQL Server running or you want to spin up one for testing, you can do it with the official Docker image.

Important

If you’re on a Mac with Apple Silicon (e.g., M1 processor), ensure you’re running the latest Docker Desktop version. More info here.

To start the server:

%%bash
docker run -e "ACCEPT_EULA=Y" \
    -e "MSSQL_SA_PASSWORD=MyPassword!" \
    -p 1433:1433 \
    -d mcr.microsoft.com/mssql/server:2022-latest

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

00721df70ea8d5f91c792a84f28f4e0fc6c0ff53f1f4d04cb6911a3a4714deba

Important

Ensure you set a strong password, otherwise the container will shut down silently!

Ensure that your container is running (run the command a few seconds after running the previous one to ensure it dind’t shut down silently):

%%bash
docker ps

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

If you have issues with the previous command, you can try with SQL Edge:

%%bash
docker run -e "ACCEPT_EULA=1" -e "MSSQL_SA_PASSWORD=MyPassword!" \
    -e "MSSQL_PID=Developer" -e "MSSQL_USER=sa" \
    -p 1433:1433 -d --name=sql mcr.microsoft.com/azure-sql-edge

fabfc30490a17dc0a48313c35289218ff563070b11622bc43f07e82080b2a201

Ensure the server is running (wait for a few seconds before running it):

%%bash
docker ps

CONTAINER ID   IMAGE                              COMMAND                  CREATED         STATUS         PORTS                              NAMES
fabfc30490a1   mcr.microsoft.com/azure-sql-edge   "/opt/mssql/bin/perm…"   5 seconds ago   Up 4 seconds   1401/tcp, 0.0.0.0:1433->1433/tcp   sql

Installing `pyodbc`#

pyodbc will allow us to connect to SQL Server. If you’re on macOS or Linux, you need to install unixODBC. Note that when installing the ODBC driver on macOS using brew, unixODBC is also installed.

Install pyodbc with:

pip install pyodbc

Note

If you’re on a Mac with Apple Silicon (e.g., M1 processor), you might encounter issues, if so, try thi:

pip install pyodbc==4.0.34

Verify a successful installation with:

import pyodbc

Verify that pyodbc is able to findn the SQL Server driver:

pyodbc.drivers()

['ODBC Driver 18 for SQL Server']

Tip

If the driver doesn’t appear, uninstalling pyodbc and re-installing it again might fix the problem.

If you’re on a Mac with Apple Silicon, ensure you installed pyodbc with pip, since conda might lead to issues.

Starting the connection#

To start the connection, execute the following, change the values to match your SQL Server’s configurationo:

from sqlalchemy import create_engine
from sqlalchemy.engine import URL

connection_url = URL.create(
    "mssql+pyodbc",
    username="sa",
    password="MyPassword!",
    host="localhost",
    port=1433,
    database="master",
    query={
        "driver": "ODBC Driver 18 for SQL Server",
        "Encrypt": "yes",
        "TrustServerCertificate": "yes",
    },
)
engine = create_engine(connection_url)

Note

If using pytds, the autocommit feature is disabled since it’s not compatible with JupySQL.

Install, load the Jupyter extension and start the connection:

%pip install jupysql --quiet

Note: you may need to restart the kernel to use updated packages.

%load_ext sql
%sql engine

Found pyproject.toml from '/Users/eduardo/dev/jupysql'

Note

If you see the following error:

InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [unixODBC][Driver Manager]Data source name not found and no default driver specified (0) (SQLDriverConnect)')
(Background on this error at: https://sqlalche.me/e/14/rvf5)

It might be that you’re missing the SQL Server ODBC driver or that pyodbc cannot find it.

Load sample data#

Let’s upload some sample data:

import pandas as pd

df = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape

(1369769, 19)

df.sample(100_000).to_sql(
    name="taxi", con=engine, chunksize=100_000, if_exists="replace"
)

Query#

Query the new table:

%%sql
select COUNT(*) FROM taxi

Running query in 'mssql+pyodbc://sa:***@localhost:1433/master?Encrypt=yes&TrustServerCertificate=yes&driver=ODBC+Driver+18+for+SQL+Server'


100000

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

List the tables in the database:

%sqlcmd tables

Name
MSreplication_options
spt_fallback_db
spt_fallback_dev
spt_fallback_usg
spt_monitor
taxi

List columns in the taxi table:

%sqlcmd columns --table taxi

name	type	nullable	default	autoincrement	comment
index	BIGINT	True	None	False	None
VendorID	BIGINT	True	None	False	None
tpep_pickup_datetime	DATETIME	True	None	False	None
tpep_dropoff_datetime	DATETIME	True	None	False	None
passenger_count	FLOAT	True	None	False	None
trip_distance	FLOAT	True	None	False	None
RatecodeID	FLOAT	True	None	False	None
store_and_fwd_flag	VARCHAR COLLATE "SQL_Latin1_General_CP1_CI_AS"	True	None	False	None
PULocationID	BIGINT	True	None	False	None
DOLocationID	BIGINT	True	None	False	None
payment_type	BIGINT	True	None	False	None
fare_amount	FLOAT	True	None	False	None
extra	FLOAT	True	None	False	None
mta_tax	FLOAT	True	None	False	None
tip_amount	FLOAT	True	None	False	None
tolls_amount	FLOAT	True	None	False	None
improvement_surcharge	FLOAT	True	None	False	None
total_amount	FLOAT	True	None	False	None
congestion_surcharge	FLOAT	True	None	False	None
airport_fee	FLOAT	True	None	False	None

Parametrize queries#

threshold = 10

%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}

Running query in 'mssql+pyodbc://sa:***@localhost:1433/master?Encrypt=yes&TrustServerCertificate=yes&driver=ODBC+Driver+18+for+SQL+Server'


94705

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

threshold = 0.5

%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}

Running query in 'mssql+pyodbc://sa:***@localhost:1433/master?Encrypt=yes&TrustServerCertificate=yes&driver=ODBC+Driver+18+for+SQL+Server'


5326

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

CTEs#

You can break down queries into multiple cells, JupySQL will build a CTE for you:

%%sql --save many_passengers --no-execute
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93

Running query in 'mssql+pyodbc://sa:***@localhost:1433/master?Encrypt=yes&TrustServerCertificate=yes&driver=ODBC+Driver+18+for+SQL+Server'

Skipping execution...

%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Running query in 'mssql+pyodbc://sa:***@localhost:1433/master?Encrypt=yes&TrustServerCertificate=yes&driver=ODBC+Driver+18+for+SQL+Server'

	_1	_2
0.0	2.5377720207253898	18.83

ResultSet : to convert to pandas, call .DataFrame() or to polars, call .PolarsDataFrame()

This is what JupySQL executes:

query = %sqlcmd snippets trip_stats
print(query)

WITH many_passengers AS (
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Plotting#

Boxplot#

%sqlplot boxplot --table many_passengers --column trip_distance

Plotting using saved snippet : many_passengers

<Axes: title={'center': "'trip_distance' from 'many_passengers'"}, ylabel='trip_distance'>

../_images/5e7936eb890c173681c92ae58d07dec87872def0c344efe05d835be47fbde5e7.png

Bar#

%sqlplot bar --table taxi --column vendorid

Removing NULLs, if there exists any from vendorid

<Axes: title={'center': 'taxi'}, xlabel='vendorid', ylabel='Count'>

../_images/571a54a6f7e04bfcbb0649ab808a0771b88f488845bbf853e8018be6b9a25f9c.png

Pie#

%sqlplot pie --table taxi --column vendorid

Removing NULLs, if there exists any from vendorid

<Axes: title={'center': 'taxi'}>

../_images/596d55e550d63f9ece0e3b1c0f1c3cde20ece7d8e0b8a5830130b1aebc428393.png

Clean up#

%%bash
docker container ls

CONTAINER ID   IMAGE                              COMMAND                  CREATED         STATUS         PORTS                              NAMES
fabfc30490a1   mcr.microsoft.com/azure-sql-edge   "/opt/mssql/bin/perm…"   7 minutes ago   Up 7 minutes   1401/tcp, 0.0.0.0:1433->1433/tcp   sql

%%bash
docker container stop sql

sql

%%bash
docker container rm sql

sql

%%bash
docker container ls

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Microsoft SQL Server

Contents

Microsoft SQL Server#

Pre-requisites#

Start Microsoft SQL Server instance#

Installing pyodbc#

Starting the connection#

Load sample data#

Query#

Parametrize queries#

CTEs#

Plotting#

Boxplot#

Bar#

Pie#

Clean up#

Installing `pyodbc`#