MariaDB#

In this tutorial, we’ll see how to query MariaDB from Jupyter. Optionally, you can spin up a testing server.

Tip

If you encounter issues, feel free to join our community and we’ll be happy to help!

Pre-requisites#

To run this tutorial, you need to install the mysqlclient package.

Note

We highly recommend you that you install it using conda, since it’ll also install mysql-connector-c; if you want to use pip, then you need to install mysql-connector-c and then mysqlclient.

%conda install mysqlclient -c conda-forge --quiet
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /Users/eduardo/miniconda3/envs/jupysql

  added / updated specs:
    - mysqlclient


The following NEW packages will be INSTALLED:

  mysql-connector-c  pkgs/main/osx-arm64::mysql-connector-c-6.1.11-h4a942e0_1 
  mysqlclient        pkgs/main/osx-arm64::mysqlclient-2.0.3-py310hc377ac9_1 


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done

Note: you may need to restart the kernel to use updated packages.

Start MariaDB instance#

If you don’t have a MariaDB Server running or you want to spin up one for testing, you can do it with the official Docker image.

To start the server:

%%bash
docker run --detach --name mariadb \
    --env MARIADB_USER=user \
    --env MARIADB_PASSWORD=password \
    --env MARIADB_ROOT_PASSWORD=password \
    --env MARIADB_DATABASE=db \
    -p 3306:3306 mariadb:latest
c2a0a18f9c37285ffdb17b22d75a3a8ae789a93f58a59c9c1892a4f30f7bf9a2

Ensure that the container is running:

%%bash
docker ps
CONTAINER ID   IMAGE            COMMAND                  CREATED        STATUS                  PORTS                    NAMES
c2a0a18f9c37   mariadb:latest   "docker-entrypoint.s…"   1 second ago   Up Less than a second   0.0.0.0:3306->3306/tcp   mariadb

Load sample data#

Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:

%pip install pandas pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.
import pandas as pd

df = pd.read_parquet(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape
(1369769, 19)

As you can see, this chunk of data contains ~1.4M rows, loading the data will take about a minute:

from sqlalchemy import create_engine

engine = create_engine("mysql+mysqldb://user:password@127.0.0.1:3306/db")
df.to_sql(name="taxi", con=engine, chunksize=100_000)
engine.dispose()

Query#

Note

mysql and mysql+pymysql connections (and perhaps others) don’t read your client character set information from .my.cnf. You need to specify it in the connection string:

mysql+pymysql://scott:tiger@localhost/foo?charset=utf8

Now, let’s install JupySQL, authenticate and start querying the data!

%pip install jupysql --quiet
%load_ext sql
%sql mysql+mysqldb://user:password@127.0.0.1:3306/db
Note: you may need to restart the kernel to use updated packages.

Important

If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!

List the tables in the database:

%sqlcmd tables
Name
taxi

List columns in the taxi table:

%sqlcmd columns --table taxi
name type default comment nullable autoincrement
index BIGINT None None True False
VendorID BIGINT None None True False
tpep_pickup_datetime DATETIME None None True
tpep_dropoff_datetime DATETIME None None True
passenger_count DOUBLE None None True
trip_distance DOUBLE None None True
RatecodeID DOUBLE None None True
store_and_fwd_flag TEXT None None True
PULocationID BIGINT None None True False
DOLocationID BIGINT None None True False
payment_type BIGINT None None True False
fare_amount DOUBLE None None True
extra DOUBLE None None True
mta_tax DOUBLE None None True
tip_amount DOUBLE None None True
tolls_amount DOUBLE None None True
improvement_surcharge DOUBLE None None True
total_amount DOUBLE None None True
congestion_surcharge DOUBLE None None True
airport_fee DOUBLE None None True

Query our data:

%%sql
SELECT COUNT(*) FROM taxi
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*)
1369769

Parametrize queries#

threshold = 10
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*)
1297415
threshold = 0.5
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*)
73849

CTEs#

You can break down queries into multiple cells, JupySQL will build a CTE for you:

%%sql --save many_passengers --no-execute
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
Skipping execution...
%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
*  mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
MIN(trip_distance) AVG(trip_distance) MAX(trip_distance)
0.0 2.5010889812889836 18.92

This is what JupySQL executes:

query = %sqlcmd snippets trip_stats
print(query)
WITH `many_passengers` AS (
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers

Plotting#

%sqlplot histogram --table many_passengers --column trip_distance --with many_passengers
<AxesSubplot: title={'center': "'trip_distance' from 'many_passengers'"}, xlabel='trip_distance', ylabel='Count'>
../_images/20ac402f376b1941dfb5dae5747e0b2c512b09fa1a6ebf60bf2134f818a98588.png

Clean up#

To stop and remove the container:

%%bash
docker container ls
CONTAINER ID   IMAGE            COMMAND                  CREATED         STATUS         PORTS                    NAMES
c2a0a18f9c37   mariadb:latest   "docker-entrypoint.s…"   2 minutes ago   Up 2 minutes   0.0.0.0:3306->3306/tcp   mariadb
%%bash
docker container stop mariadb
mariadb
%%bash
docker container rm mariadb
mariadb
%%bash
docker container ls
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES