MySQL#

In this tutorial, we’ll see how to query MySQL from Jupyter. Optionally, you can spin up a testing server.

Tip

If you encounter issues, feel free to join our community and we’ll be happy to help!

Pre-requisites#

To run this tutorial, you need to install the mysqlclient package.

Note

We highly recommend you that you install it using conda, since it’ll also install mysql-connector-c; if you want to use pip, then you need to install mysql-connector-c and then mysqlclient.

%conda install mysqlclient -c conda-forge --quiet
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.

Start MySQL instance#

If you don’t have a MySQL Server running or you want to spin up one for testing, you can do it with the official Docker image.

To start the server:

%%bash
docker run --name mysql -e MYSQL_DATABASE=db \
  -e MYSQL_USER=user \
  -e MYSQL_PASSWORD=password \
  -e MYSQL_ROOT_PASSWORD=password \
  -p 3306:3306 -d mysql
b9f7f973698a0063198a7e6358445e942de4905d18b99145a7dfc8bb947bfa97

Ensure that the container is running:

%%bash
docker ps
CONTAINER ID   IMAGE     COMMAND                  CREATED        STATUS                  PORTS                               NAMES
b9f7f973698a   mysql     "docker-entrypoint.s…"   1 second ago   Up Less than a second   0.0.0.0:3306->3306/tcp, 33060/tcp   mysql

We need to make a small configuration change, so do the following:

Open a new terminal and execute: docker exec -it mysql bash

Then: mysql --user=root --password

When prompted for a password, type: password

Once the MySQL console appears, execute:

ALTER USER user
IDENTIFIED WITH mysql_native_password
BY 'password';

Exit the MySQL console with: exit Exit the container with: exit

The session should look like this:

docker exec -it mysql bash

bash-4.4# mysql --user=root --password
Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.31 MySQL Community Server - GPL

Copyright (c) 2000, 2022, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> ALTER USER user
    -> IDENTIFIED WITH mysql_native_password
    -> BY 'password';
Query OK, 0 rows affected (0.01 sec)

mysql> exit
Bye
bash-4.4# exit
exit

Load sample data#

Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:

%pip install pandas pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.
import pandas as pd

df = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape
(1369769, 19)

As you can see, this chunk of data contains ~1.4M rows, loading the data will take about a minute:

from sqlalchemy import create_engine

engine = create_engine("mysql+mysqldb://user:password@127.0.0.1:3306/db")
df.to_sql(name="taxi", con=engine, chunksize=100_000)
engine.dispose()

Query#

Note

mysql and mysql+pymysql connections (and perhaps others) don’t read your client character set information from .my.cnf. You need to specify it in the connection string:

mysql+pymysql://scott:tiger@localhost/foo?charset=utf8

Now, let’s install JupySQL, authenticate and start querying the data!

%pip install jupysql --quiet
%load_ext sql
%sql mysql+mysqldb://user:password@127.0.0.1:3306/db
Note: you may need to restart the kernel to use updated packages.

Important

If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!

List the tables in the database:

%sqlcmd tables
Name
taxi

List columns in the taxi table:

%sqlcmd columns --table taxi
name type default comment nullable autoincrement
index BIGINT None None True False
VendorID BIGINT None None True False
tpep_pickup_datetime DATETIME None None True
tpep_dropoff_datetime DATETIME None None True
passenger_count DOUBLE None None True
trip_distance DOUBLE None None True
RatecodeID DOUBLE None None True
store_and_fwd_flag TEXT None None True
PULocationID BIGINT None None True False
DOLocationID BIGINT None None True False
payment_type BIGINT None None True False
fare_amount DOUBLE None None True
extra DOUBLE None None True
mta_tax DOUBLE None None True
tip_amount DOUBLE None None True
tolls_amount DOUBLE None None True
improvement_surcharge DOUBLE None None True
total_amount DOUBLE None None True
congestion_surcharge DOUBLE None None True
airport_fee DOUBLE None None True

Query our data:

%%sql
SELECT COUNT(*) FROM taxi
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*)
1369769

Parametrize queries#

threshold = 10
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*)
1297415
threshold = 0.5
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*)
73849

CTEs#

You can break down queries into multiple cells, JupySQL will build a CTE for you:

%%sql --save many_passengers --no-execute
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
Skipping execution...
%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
MIN(trip_distance) AVG(trip_distance) MAX(trip_distance)
0.0 2.5010889812889836 18.92

This is what JupySQL executes:

query = %sqlcmd snippets trip_stats
print(query)
WITH `many_passengers` AS (
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Plotting#

%sqlplot histogram --table many_passengers --column trip_distance --with many_passengers
<AxesSubplot: title={'center': "'trip_distance' from 'many_passengers'"}, xlabel='trip_distance', ylabel='Count'>
../_images/20ac402f376b1941dfb5dae5747e0b2c512b09fa1a6ebf60bf2134f818a98588.png

Clean up#

To stop and remove the container:

%%bash
docker container ls
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS                               NAMES
b9f7f973698a   mysql     "docker-entrypoint.s…"   2 minutes ago   Up 2 minutes   0.0.0.0:3306->3306/tcp, 33060/tcp   mysql
%%bash
docker container stop mysql
mysql
%%bash
docker container rm mysql
mysql
%%bash
docker container ls
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES