Run SQL on JSON files#
In this tutorial, we’ll show you how to query JSON with JupySQL and DuckDB.
First, let’s install the required dependencies:
%pip install jupysql duckdb duckdb-engine rich --quiet
Show code cell output
Note: you may need to restart the kernel to use updated packages.
Now, let’s generate some data.
We’ll write it in typical JSON format as well as JSON Lines. JSON Lines, or newline-delimited JSON, is a structured file format in which each individual line is a valid JSON object, separated by a newline character (/n
). Our sample data contains four rows:
from pathlib import Path
import json
data = [
{
"name": "John",
"age": 25,
"friends": ["Jake", "Kelly"],
"likes": {"pizza": True, "tacos": True},
},
{
"name": "Jake",
"age": 20,
"friends": ["John"],
"likes": {"pizza": False, "tacos": True},
},
{
"name": "Kelly",
"age": 21,
"friends": ["John", "Sam"],
"likes": {"pizza": True, "tacos": True},
},
{
"name": "Sam",
"age": 22,
"friends": ["Kelly"],
"likes": {"pizza": False, "tacos": True},
},
]
Next, let’s dump our json data into a .json
file:
_ = Path("people.json").write_text(json.dumps(data))
print(data)
[{'name': 'John', 'age': 25, 'friends': ['Jake', 'Kelly'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Jake', 'age': 20, 'friends': ['John'], 'likes': {'pizza': False, 'tacos': True}}, {'name': 'Kelly', 'age': 21, 'friends': ['John', 'Sam'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Sam', 'age': 22, 'friends': ['Kelly'], 'likes': {'pizza': False, 'tacos': True}}]
We should also produce a .jsonl
file. Due to its newline-delimited nature, we will need to format our data in a way such that each object in our data array is separated by /n
.
lines = ""
for d in data:
lines += json.dumps(d) + "\n"
_ = Path("people.jsonl").write_text(lines)
print(lines)
{"name": "John", "age": 25, "friends": ["Jake", "Kelly"], "likes": {"pizza": true, "tacos": true}}
{"name": "Jake", "age": 20, "friends": ["John"], "likes": {"pizza": false, "tacos": true}}
{"name": "Kelly", "age": 21, "friends": ["John", "Sam"], "likes": {"pizza": true, "tacos": true}}
{"name": "Sam", "age": 22, "friends": ["Kelly"], "likes": {"pizza": false, "tacos": true}}
Query#
Note
Documentation for DuckDB’s JSON capabilities is available here.
Load the extension and start a DuckDB in-memory database:
%load_ext sql
%sql duckdb://
Read the JSON data:
%%sql
SELECT *
FROM read_json_auto('people.json')
name | age | friends | likes |
---|---|---|---|
John | 25 | ['Jake', 'Kelly'] | {'pizza': True, 'tacos': True} |
Jake | 20 | ['John'] | {'pizza': False, 'tacos': True} |
Kelly | 21 | ['John', 'Sam'] | {'pizza': True, 'tacos': True} |
Sam | 22 | ['Kelly'] | {'pizza': False, 'tacos': True} |
Extract fields#
Extract fields from a JSON record. Keep in mind when using read_json_auto
, arrays are 1-indexed (start at 1 rather than 0):
%%sql
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
JSON lines format is also supported:
%%sql
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.jsonl')
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
We can also use read_json_objects
and format our queries differently. In this case, arrays are zero-indexed:
%%sql
SELECT
json ->> '$.name' AS name,
json ->> '$.friends[0]' AS first_friend,
json ->> '$.likes.pizza' AS likes_pizza,
json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl', format="auto")
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | true | true |
Jake | John | false | true |
Kelly | John | true | true |
Sam | Kelly | false | true |
Looks like everybody likes tacos!
Extract schema#
Infer the JSON schema:
%%sql
SELECT
json_structure(json),
json_structure(json ->> '$.likes'),
FROM read_json_objects('people.jsonl', format="auto")
json_structure("json") | json_structure(("json" ->> '$.likes')) |
---|---|
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} | {"pizza":"BOOLEAN","tacos":"BOOLEAN"} |
%%sql schema <<
SELECT
json_structure(json) AS schema_all,
json_structure(json ->> '$.likes') AS schema_likes,
FROM read_json_objects('people.jsonl', format="auto")
Pretty print the inferred schema:
from rich import print_json
row = schema.DataFrame().iloc[0]
print("Schema:")
print_json(row.schema_all)
print("\n\nSchema (likes):")
print_json(row.schema_likes)
Schema:
{ "name": "VARCHAR", "age": "UBIGINT", "friends": [ "VARCHAR" ], "likes": { "pizza": "BOOLEAN", "tacos": "BOOLEAN" } }
Schema (likes):
{ "pizza": "BOOLEAN", "tacos": "BOOLEAN" }
Store snippets#
You can use JupySQL’s --save
feature to store a SQL snippet so you can keep your queries succinct:
%%sql --save clean_data_json
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
%%sql
SELECT * FROM clean_data_json
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |
Or using our .jsonl
file:
%%sql --save clean_data_jsonl
SELECT
json ->> '$.name' AS name,
json ->> '$.friends[0]' AS first_friend,
json ->> '$.likes.pizza' AS likes_pizza,
json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl', format="auto")
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | true | true |
Jake | John | false | true |
Kelly | John | true | true |
Sam | Kelly | false | true |
%%sql
SELECT * FROM clean_data_jsonl
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | true | true |
Jake | John | false | true |
Kelly | John | true | true |
Sam | Kelly | false | true |
Export to CSV#
Note
Using --with
isn’t supported when exporting to CSV.
To export to CSV:
%%sql
COPY (
SELECT
name,
friends[1] AS first_friend,
likes.pizza AS likes_pizza,
likes.tacos AS likes_tacos
FROM read_json_auto('people.json', format="auto")
)
TO 'people.csv' (HEADER, DELIMITER ',');
Count |
---|
%%sql
SELECT * FROM 'people.csv'
name | first_friend | likes_pizza | likes_tacos |
---|---|---|---|
John | Jake | True | True |
Jake | John | False | True |
Kelly | John | True | True |
Sam | Kelly | False | True |