MySQL#
In this tutorial, we’ll see how to query MySQL from Jupyter. Optionally, you can spin up a testing server.
Tip
If you encounter issues, feel free to join our community and we’ll be happy to help!
Pre-requisites#
To run this tutorial, you need to install the mysqlclient
package.
Note
We highly recommend you that you install it using conda
, since it’ll also install mysql-connector-c
; if you want to use pip
, then you need to install mysql-connector-c
and then mysqlclient
.
%conda install mysqlclient -c conda-forge --quiet
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
# All requested packages already installed.
Note: you may need to restart the kernel to use updated packages.
Start MySQL instance#
If you don’t have a MySQL Server running or you want to spin up one for testing, you can do it with the official Docker image.
To start the server:
%%bash
docker run --name mysql -e MYSQL_DATABASE=db \
-e MYSQL_USER=user \
-e MYSQL_PASSWORD=password \
-e MYSQL_ROOT_PASSWORD=password \
-p 3306:3306 -d mysql
b9f7f973698a0063198a7e6358445e942de4905d18b99145a7dfc8bb947bfa97
Ensure that the container is running:
%%bash
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b9f7f973698a mysql "docker-entrypoint.s…" 1 second ago Up Less than a second 0.0.0.0:3306->3306/tcp, 33060/tcp mysql
We need to make a small configuration change, so do the following:
Open a new terminal and execute: docker exec -it mysql bash
Then: mysql --user=root --password
When prompted for a password, type: password
Once the MySQL console appears, execute:
ALTER USER user
IDENTIFIED WITH mysql_native_password
BY 'password';
Exit the MySQL console with: exit
Exit the container with: exit
The session should look like this:
docker exec -it mysql bash
bash-4.4# mysql --user=root --password
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.31 MySQL Community Server - GPL
Copyright (c) 2000, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> ALTER USER user
-> IDENTIFIED WITH mysql_native_password
-> BY 'password';
Query OK, 0 rows affected (0.01 sec)
mysql> exit
Bye
bash-4.4# exit
exit
Load sample data#
Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:
%pip install pandas pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
df = pd.read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape
(1369769, 19)
As you can see, this chunk of data contains ~1.4M rows, loading the data will take about a minute:
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://user:password@127.0.0.1:3306/db")
df.to_sql(name="taxi", con=engine, chunksize=100_000)
engine.dispose()
Query#
Note
mysql
and mysql+pymysql
connections (and perhaps others) don’t read your client character set information from .my.cnf.
You need to specify it in the connection string:
mysql+pymysql://scott:tiger@localhost/foo?charset=utf8
Now, let’s install JupySQL, authenticate and start querying the data!
%pip install jupysql --quiet
%load_ext sql
%sql mysql+mysqldb://user:password@127.0.0.1:3306/db
Note: you may need to restart the kernel to use updated packages.
Important
If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!
List the tables in the database:
%sqlcmd tables
Name |
---|
taxi |
List columns in the taxi table:
%sqlcmd columns --table taxi
name | type | default | comment | nullable | autoincrement |
---|---|---|---|---|---|
index | BIGINT | None | None | True | False |
VendorID | BIGINT | None | None | True | False |
tpep_pickup_datetime | DATETIME | None | None | True | |
tpep_dropoff_datetime | DATETIME | None | None | True | |
passenger_count | DOUBLE | None | None | True | |
trip_distance | DOUBLE | None | None | True | |
RatecodeID | DOUBLE | None | None | True | |
store_and_fwd_flag | TEXT | None | None | True | |
PULocationID | BIGINT | None | None | True | False |
DOLocationID | BIGINT | None | None | True | False |
payment_type | BIGINT | None | None | True | False |
fare_amount | DOUBLE | None | None | True | |
extra | DOUBLE | None | None | True | |
mta_tax | DOUBLE | None | None | True | |
tip_amount | DOUBLE | None | None | True | |
tolls_amount | DOUBLE | None | None | True | |
improvement_surcharge | DOUBLE | None | None | True | |
total_amount | DOUBLE | None | None | True | |
congestion_surcharge | DOUBLE | None | None | True | |
airport_fee | DOUBLE | None | None | True |
Query our data:
%%sql
SELECT COUNT(*) FROM taxi
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*) |
---|
1369769 |
Parametrize queries#
threshold = 10
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*) |
---|
1297415 |
threshold = 0.5
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*) |
---|
73849 |
CTEs#
You can break down queries into multiple cells, JupySQL will build a CTE for you:
%%sql --save many_passengers --no-execute
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93
* mysql+mysqldb://user:***@127.0.0.1:3306/db
Skipping execution...
%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
MIN(trip_distance) | AVG(trip_distance) | MAX(trip_distance) |
---|---|---|
0.0 | 2.5010889812889836 | 18.92 |
This is what JupySQL executes:
query = %sqlcmd snippets trip_stats
print(query)
WITH `many_passengers` AS (
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
Plotting#
%sqlplot histogram --table many_passengers --column trip_distance --with many_passengers
<AxesSubplot: title={'center': "'trip_distance' from 'many_passengers'"}, xlabel='trip_distance', ylabel='Count'>

Clean up#
To stop and remove the container:
%%bash
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b9f7f973698a mysql "docker-entrypoint.s…" 2 minutes ago Up 2 minutes 0.0.0.0:3306->3306/tcp, 33060/tcp mysql
%%bash
docker container stop mysql
mysql
%%bash
docker container rm mysql
mysql
%%bash
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES