MariaDB#
In this tutorial, we’ll see how to query MariaDB from Jupyter. Optionally, you can spin up a testing server.
Tip
If you encounter issues, feel free to join our community and we’ll be happy to help!
Pre-requisites#
To run this tutorial, you need to install the mysqlclient
package.
Note
We highly recommend you that you install it using conda
, since it’ll also install mysql-connector-c
; if you want to use pip
, then you need to install mysql-connector-c
and then mysqlclient
.
%conda install mysqlclient -c conda-forge --quiet
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: /Users/eduardo/miniconda3/envs/jupysql
added / updated specs:
- mysqlclient
The following NEW packages will be INSTALLED:
mysql-connector-c pkgs/main/osx-arm64::mysql-connector-c-6.1.11-h4a942e0_1
mysqlclient pkgs/main/osx-arm64::mysqlclient-2.0.3-py310hc377ac9_1
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Note: you may need to restart the kernel to use updated packages.
Start MariaDB instance#
If you don’t have a MariaDB Server running or you want to spin up one for testing, you can do it with the official Docker image.
To start the server:
%%bash
docker run --detach --name mariadb \
--env MARIADB_USER=user \
--env MARIADB_PASSWORD=password \
--env MARIADB_ROOT_PASSWORD=password \
--env MARIADB_DATABASE=db \
-p 3306:3306 mariadb:latest
c2a0a18f9c37285ffdb17b22d75a3a8ae789a93f58a59c9c1892a4f30f7bf9a2
Ensure that the container is running:
%%bash
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c2a0a18f9c37 mariadb:latest "docker-entrypoint.s…" 1 second ago Up Less than a second 0.0.0.0:3306->3306/tcp mariadb
Load sample data#
Now, let’s fetch some sample data. We’ll be using the NYC taxi dataset:
%pip install pandas pyarrow --quiet
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
df = pd.read_parquet(
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
)
df.shape
(1369769, 19)
As you can see, this chunk of data contains ~1.4M rows, loading the data will take about a minute:
from sqlalchemy import create_engine
engine = create_engine("mysql+mysqldb://user:password@127.0.0.1:3306/db")
df.to_sql(name="taxi", con=engine, chunksize=100_000)
engine.dispose()
Query#
Note
mysql
and mysql+pymysql
connections (and perhaps others) don’t read your client character set information from .my.cnf.
You need to specify it in the connection string:
mysql+pymysql://scott:tiger@localhost/foo?charset=utf8
Now, let’s install JupySQL, authenticate and start querying the data!
%pip install jupysql --quiet
%load_ext sql
%sql mysql+mysqldb://user:password@127.0.0.1:3306/db
Note: you may need to restart the kernel to use updated packages.
Important
If the cell above fails, you might have some missing packages. Message us on Slack and we’ll help you!
List the tables in the database:
%sqlcmd tables
Name |
---|
taxi |
List columns in the taxi table:
%sqlcmd columns --table taxi
name | type | default | comment | nullable | autoincrement |
---|---|---|---|---|---|
index | BIGINT | None | None | True | False |
VendorID | BIGINT | None | None | True | False |
tpep_pickup_datetime | DATETIME | None | None | True | |
tpep_dropoff_datetime | DATETIME | None | None | True | |
passenger_count | DOUBLE | None | None | True | |
trip_distance | DOUBLE | None | None | True | |
RatecodeID | DOUBLE | None | None | True | |
store_and_fwd_flag | TEXT | None | None | True | |
PULocationID | BIGINT | None | None | True | False |
DOLocationID | BIGINT | None | None | True | False |
payment_type | BIGINT | None | None | True | False |
fare_amount | DOUBLE | None | None | True | |
extra | DOUBLE | None | None | True | |
mta_tax | DOUBLE | None | None | True | |
tip_amount | DOUBLE | None | None | True | |
tolls_amount | DOUBLE | None | None | True | |
improvement_surcharge | DOUBLE | None | None | True | |
total_amount | DOUBLE | None | None | True | |
congestion_surcharge | DOUBLE | None | None | True | |
airport_fee | DOUBLE | None | None | True |
Query our data:
%%sql
SELECT COUNT(*) FROM taxi
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*) |
---|
1369769 |
Parametrize queries#
threshold = 10
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*) |
---|
1297415 |
threshold = 0.5
%%sql
SELECT COUNT(*) FROM taxi
WHERE trip_distance < {{threshold}}
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
COUNT(*) |
---|
73849 |
CTEs#
You can break down queries into multiple cells, JupySQL will build a CTE for you:
%%sql --save many_passengers --no-execute
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93
* mysql+mysqldb://user:***@127.0.0.1:3306/db
Skipping execution...
%%sql --save trip_stats --with many_passengers
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
* mysql+mysqldb://user:***@127.0.0.1:3306/db
1 rows affected.
MIN(trip_distance) | AVG(trip_distance) | MAX(trip_distance) |
---|---|---|
0.0 | 2.5010889812889836 | 18.92 |
This is what JupySQL executes:
query = %sqlcmd snippets trip_stats
print(query)
WITH `many_passengers` AS (
SELECT *
FROM taxi
WHERE passenger_count > 3
-- remove top 1% outliers for better visualization
AND trip_distance < 18.93)
SELECT MIN(trip_distance), AVG(trip_distance), MAX(trip_distance)
FROM many_passengers
Plotting#
%sqlplot histogram --table many_passengers --column trip_distance --with many_passengers
<AxesSubplot: title={'center': "'trip_distance' from 'many_passengers'"}, xlabel='trip_distance', ylabel='Count'>
Clean up#
To stop and remove the container:
%%bash
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c2a0a18f9c37 mariadb:latest "docker-entrypoint.s…" 2 minutes ago Up 2 minutes 0.0.0.0:3306->3306/tcp mariadb
%%bash
docker container stop mariadb
mariadb
%%bash
docker container rm mariadb
mariadb
%%bash
docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES