Run SQL on JSON files#

In this tutorial, we’ll show you how to query JSON with JupySQL and DuckDB.

First, let’s install the required dependencies:

%pip install jupysql duckdb duckdb-engine rich --quiet
Hide code cell output
Note: you may need to restart the kernel to use updated packages.

Now, let’s generate some data.

We’ll write it in typical JSON format as well as JSON Lines. JSON Lines, or newline-delimited JSON, is a structured file format in which each individual line is a valid JSON object, separated by a newline character (/n). Our sample data contains four rows:

from pathlib import Path
import json

data = [
    {
        "name": "John",
        "age": 25,
        "friends": ["Jake", "Kelly"],
        "likes": {"pizza": True, "tacos": True},
    },
    {
        "name": "Jake",
        "age": 20,
        "friends": ["John"],
        "likes": {"pizza": False, "tacos": True},
    },
    {
        "name": "Kelly",
        "age": 21,
        "friends": ["John", "Sam"],
        "likes": {"pizza": True, "tacos": True},
    },
    {
        "name": "Sam",
        "age": 22,
        "friends": ["Kelly"],
        "likes": {"pizza": False, "tacos": True},
    },
]

Next, let’s dump our json data into a .json file:

_ = Path("people.json").write_text(json.dumps(data))
print(data)
[{'name': 'John', 'age': 25, 'friends': ['Jake', 'Kelly'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Jake', 'age': 20, 'friends': ['John'], 'likes': {'pizza': False, 'tacos': True}}, {'name': 'Kelly', 'age': 21, 'friends': ['John', 'Sam'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Sam', 'age': 22, 'friends': ['Kelly'], 'likes': {'pizza': False, 'tacos': True}}]

We should also produce a .jsonl file. Due to its newline-delimited nature, we will need to format our data in a way such that each object in our data array is separated by /n.

lines = ""

for d in data:
    lines += json.dumps(d) + "\n"

_ = Path("people.jsonl").write_text(lines)
print(lines)
{"name": "John", "age": 25, "friends": ["Jake", "Kelly"], "likes": {"pizza": true, "tacos": true}}
{"name": "Jake", "age": 20, "friends": ["John"], "likes": {"pizza": false, "tacos": true}}
{"name": "Kelly", "age": 21, "friends": ["John", "Sam"], "likes": {"pizza": true, "tacos": true}}
{"name": "Sam", "age": 22, "friends": ["Kelly"], "likes": {"pizza": false, "tacos": true}}

Query#

Note

Documentation for DuckDB’s JSON capabilities is available here.

Load the extension and start a DuckDB in-memory database:

%load_ext sql
%sql duckdb://
Connecting to 'duckdb://'

Read the JSON data:

%%sql
SELECT *
FROM read_json_auto('people.json')
Running query in 'duckdb://'
name age friends likes
John 25 ['Jake', 'Kelly'] {'pizza': True, 'tacos': True}
Jake 20 ['John'] {'pizza': False, 'tacos': True}
Kelly 21 ['John', 'Sam'] {'pizza': True, 'tacos': True}
Sam 22 ['Kelly'] {'pizza': False, 'tacos': True}

Extract fields#

Extract fields from a JSON record. Keep in mind when using read_json_auto, arrays are 1-indexed (start at 1 rather than 0):

%%sql
SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True

JSON lines format is also supported:

%%sql
SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
FROM read_json_auto('people.jsonl')
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True

We can also use read_json_objects and format our queries differently. In this case, arrays are zero-indexed:

%%sql
SELECT
    json ->> '$.name' AS name,
    json ->> '$.friends[0]' AS first_friend,
    json ->> '$.likes.pizza' AS likes_pizza,
    json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl', format="auto")
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake true true
Jake John false true
Kelly John true true
Sam Kelly false true

Looks like everybody likes tacos!

Extract schema#

Infer the JSON schema:

%%sql
SELECT
    json_structure(json),
    json_structure(json ->> '$.likes'),
FROM read_json_objects('people.jsonl', format="auto")
Running query in 'duckdb://'
json_structure("json") json_structure(("json" ->> '$.likes'))
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
%%sql schema <<
SELECT
    json_structure(json) AS schema_all,
    json_structure(json ->> '$.likes') AS schema_likes,
FROM read_json_objects('people.jsonl', format="auto")
Running query in 'duckdb://'

Pretty print the inferred schema:

from rich import print_json

row = schema.DataFrame().iloc[0]

print("Schema:")
print_json(row.schema_all)

print("\n\nSchema (likes):")
print_json(row.schema_likes)
Schema:
{
  "name": "VARCHAR",
  "age": "UBIGINT",
  "friends": [
    "VARCHAR"
  ],
  "likes": {
    "pizza": "BOOLEAN",
    "tacos": "BOOLEAN"
  }
}
Schema (likes):
{
  "pizza": "BOOLEAN",
  "tacos": "BOOLEAN"
}

Store snippets#

You can use JupySQL’s --save feature to store a SQL snippet so you can keep your queries succinct:

%%sql --save clean_data_json
SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True
%%sql
SELECT * FROM clean_data_json
Generating CTE with stored snippets: 'clean_data_json'
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True

Or using our .jsonl file:

%%sql --save clean_data_jsonl
SELECT 
    json ->> '$.name' AS name,
    json ->> '$.friends[0]' AS first_friend,
    json ->> '$.likes.pizza' AS likes_pizza,
    json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl', format="auto")
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake true true
Jake John false true
Kelly John true true
Sam Kelly false true
%%sql
SELECT * FROM clean_data_jsonl
Generating CTE with stored snippets: 'clean_data_jsonl'
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake true true
Jake John false true
Kelly John true true
Sam Kelly false true

Export to CSV#

Note

Using --with isn’t supported when exporting to CSV.

To export to CSV:

%%sql
COPY (
    SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
    FROM read_json_auto('people.json', format="auto")
)

TO 'people.csv' (HEADER, DELIMITER ',');
Running query in 'duckdb://'
Count
%%sql
SELECT * FROM 'people.csv'
Running query in 'duckdb://'
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True