Run SQL on JSON files#
In this tutorial, we’ll show you how to query JSON with JupySQL and DuckDB.
First, let’s install the required dependencies:
%pip install jupysql duckdb duckdb-engine rich --quiet
Show code cell output
Note: you may need to restart the kernel to use updated packages.
Now, let’s generate some data.
We’ll write it in typical JSON format as well as JSON Lines. JSON Lines, or newline-delimited JSON, is a structured file format in which each individual line is a valid JSON object, separated by a newline character (/n
). Our sample data contains four rows:
from pathlib import Path
import json
data = [
{
"name": "John",
"age": 25,
"friends": ["Jake", "Kelly"],
"likes": {"pizza": True, "tacos": True},
},
{
"name": "Jake",
"age": 20,
"friends": ["John"],
"likes": {"pizza": False, "tacos": True},
},
{
"name": "Kelly",
"age": 21,
"friends": ["John", "Sam"],
"likes": {"pizza": True, "tacos": True},
},
{
"name": "Sam",
"age": 22,
"friends": ["Kelly"],
"likes": {"pizza": False, "tacos": True},
},
]
Next, let’s dump our json data into a .json
file:
_ = Path("people.json").write_text(json.dumps(data))
print(data)
[{'name': 'John', 'age': 25, 'friends': ['Jake', 'Kelly'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Jake', 'age': 20, 'friends': ['John'], 'likes': {'pizza': False, 'tacos': True}}, {'name': 'Kelly', 'age': 21, 'friends': ['John', 'Sam'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Sam', 'age': 22, 'friends': ['Kelly'], 'likes': {'pizza': False, 'tacos': True}}]
We should also produce a .jsonl
file. Due to its newline-delimited nature, we will need to format our data in a way such that each object in our data array is separated by /n
.
lines = ""
for d in data:
lines += json.dumps(d) + "\n"
_ = Path("people.jsonl").write_text(lines)
print(lines)
{"name": "John", "age": 25, "friends": ["Jake", "Kelly"], "likes": {"pizza": true, "tacos": true}}
{"name": "Jake", "age": 20, "friends": ["John"], "likes": {"pizza": false, "tacos": true}}
{"name": "Kelly", "age": 21, "friends": ["John", "Sam"], "likes": {"pizza": true, "tacos": true}}
{"name": "Sam", "age": 22, "friends": ["Kelly"], "likes": {"pizza": false, "tacos": true}}
Query#
Note
Documentation for DuckDB’s JSON capabilities is available here.
Load the extension and start a DuckDB in-memory database:
%load_ext sql
%sql duckdb://
Read the JSON data:
%%sql
SELECT *
FROM read_json_auto('people.json')
* duckdb://
Done.
name | age | friends | likes |
---|---|---|---|
John | 25 | ['Jake', 'Kelly'] | {'pizza': True, 'tacos': True} |
Jake | 20 | ['John'] | {'pizza': False, 'tacos': True} |
Kelly | 21 | ['John', 'Sam'] | {'pizza': True, 'tacos': True} |
Sam | 22 | ['Kelly'] | {'pizza': False, 'tacos': True} |
Extract fields#
Extract fields from a JSON record. Keep in mind when using read_json_auto
, arrays are 1-indexed (start at 1 rather than 0):
%%sql
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
* duckdb://
Done.
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
JSON lines format is also supported:
%%sql
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.jsonl')
* duckdb://
Done.
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
We can also use read_json_objects
and format our queries differently. In this case, arrays are zero-indexed:
%%sql
SELECT
json ->> '$.name' AS name,
json ->> '$.friends[0]' AS first_friend,
json ->> '$.likes.pizza' AS likes_pizza,
json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl')
* duckdb://
Done.
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | true | true |
Jake | John | false | true |
Kelly | John | true | true |
Sam | Kelly | false | true |
Looks like everybody likes tacos!
Extract schema#
Infer the JSON schema:
%%sql
SELECT
json_structure(json),
json_structure(json ->> '$.likes'),
FROM read_json_objects('people.jsonl')
* duckdb://
Done.
json_structure("json") | json_structure(("json" ->> '$.likes')) |
---|---|
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
%%sql schema <<
SELECT
json_structure(json) AS schema_all,
json_structure(json ->> '$.likes') AS schema_likes,
FROM read_json_objects('people.jsonl')
* duckdb://
Done.
Pretty print the inferred schema:
from rich import print_json
row = schema.DataFrame().iloc[0]
print("Schema:")
print_json(row.schema_all)
print("\n\nSchema (likes):")
print_json(row.schema_likes)
Schema:
{ "name": "VARCHAR", "age": "UBIGINT", "friends": [ "VARCHAR" ], "likes": { "pizza": "BOOLEAN", "tacos": "BOOLEAN" } }
Schema (likes):
{ "pizza": "BOOLEAN", "tacos": "BOOLEAN" }
Store snippets#
You can use JupySQL’s --save
feature to store a SQL snippet so you can keep your queries succint:
%%sql --save clean_data_json
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
* duckdb://
Done.
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
%%sql --with clean_data_json
SELECT * FROM clean_data_json
* duckdb://
Done.
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
Or using our .jsonl
file:
%%sql --save clean_data_jsonl
SELECT
json ->> '$.name' AS name,
json ->> '$.friends[0]' AS first_friend,
json ->> '$.likes.pizza' AS likes_pizza,
json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl')
* duckdb://
Done.
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | true | true |
Jake | John | false | true |
Kelly | John | true | true |
Sam | Kelly | false | true |
%%sql --with clean_data_jsonl
SELECT * FROM clean_data_jsonl
* duckdb://
(duckdb.CatalogException) Catalog Error: Scalar Function with name json_extract_scalar does not exist!
Did you mean "json_extract"?
[SQL: WITH clean_data_jsonl AS (SELECT JSON_EXTRACT_SCALAR(json, '$.name') AS name, JSON_EXTRACT_SCALAR(json, '$.friends[0]') AS first_friend, JSON_EXTRACT_SCALAR(json, '$.likes.pizza') AS likes_pizza, JSON_EXTRACT_SCALAR(json, '$.likes.tacos') AS likes_tacos FROM READ_JSON_OBJECTS('people.jsonl')) SELECT * FROM clean_data_jsonl]
(Background on this error at: https://sqlalche.me/e/14/f405)
Export to CSV#
Note
Using --with
isn’t supported when exporting to CSV.
To export to CSV:
%%sql
COPY (
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
)
TO 'people.csv' (HEADER, DELIMITER ',');
* duckdb://
Done.
Count |
---|
4 |
%%sql
SELECT * FROM 'people.csv'
* duckdb://
Done.
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |