Run SQL on JSON files#

In this tutorial, we’ll show you how to query JSON with JupySQL and DuckDB.

First, let’s install the required dependencies:

%pip install jupysql duckdb duckdb-engine rich --quiet
Hide code cell output
Note: you may need to restart the kernel to use updated packages.

Now, let’s generate some data.

We’ll write it in typical JSON format as well as JSON Lines. JSON Lines, or newline-delimited JSON, is a structured file format in which each individual line is a valid JSON object, separated by a newline character (/n). Our sample data contains four rows:

from pathlib import Path
import json

data = [
    {
        "name": "John",
        "age": 25,
        "friends": ["Jake", "Kelly"],
        "likes": {"pizza": True, "tacos": True},
    },
    {
        "name": "Jake",
        "age": 20,
        "friends": ["John"],
        "likes": {"pizza": False, "tacos": True},
    },
    {
        "name": "Kelly",
        "age": 21,
        "friends": ["John", "Sam"],
        "likes": {"pizza": True, "tacos": True},
    },
    {
        "name": "Sam",
        "age": 22,
        "friends": ["Kelly"],
        "likes": {"pizza": False, "tacos": True},
    },
]

Next, let’s dump our json data into a .json file:

_ = Path("people.json").write_text(json.dumps(data))
print(data)
[{'name': 'John', 'age': 25, 'friends': ['Jake', 'Kelly'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Jake', 'age': 20, 'friends': ['John'], 'likes': {'pizza': False, 'tacos': True}}, {'name': 'Kelly', 'age': 21, 'friends': ['John', 'Sam'], 'likes': {'pizza': True, 'tacos': True}}, {'name': 'Sam', 'age': 22, 'friends': ['Kelly'], 'likes': {'pizza': False, 'tacos': True}}]

We should also produce a .jsonl file. Due to its newline-delimited nature, we will need to format our data in a way such that each object in our data array is separated by /n.

lines = ""

for d in data:
    lines += json.dumps(d) + "\n"

_ = Path("people.jsonl").write_text(lines)
print(lines)
{"name": "John", "age": 25, "friends": ["Jake", "Kelly"], "likes": {"pizza": true, "tacos": true}}
{"name": "Jake", "age": 20, "friends": ["John"], "likes": {"pizza": false, "tacos": true}}
{"name": "Kelly", "age": 21, "friends": ["John", "Sam"], "likes": {"pizza": true, "tacos": true}}
{"name": "Sam", "age": 22, "friends": ["Kelly"], "likes": {"pizza": false, "tacos": true}}

Query#

Note

Documentation for DuckDB’s JSON capabilities is available here.

Load the extension and start a DuckDB in-memory database:

%load_ext sql
%sql duckdb://

Read the JSON data:

%%sql
SELECT *
FROM read_json_auto('people.json')
*  duckdb://
Done.
name age friends likes
John 25 ['Jake', 'Kelly'] {'pizza': True, 'tacos': True}
Jake 20 ['John'] {'pizza': False, 'tacos': True}
Kelly 21 ['John', 'Sam'] {'pizza': True, 'tacos': True}
Sam 22 ['Kelly'] {'pizza': False, 'tacos': True}

Extract fields#

Extract fields from a JSON record. Keep in mind when using read_json_auto, arrays are 1-indexed (start at 1 rather than 0):

%%sql
SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
*  duckdb://
Done.
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True

JSON lines format is also supported:

%%sql
SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
FROM read_json_auto('people.jsonl')
*  duckdb://
Done.
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True

We can also use read_json_objects and format our queries differently. In this case, arrays are zero-indexed:

%%sql
SELECT
    json ->> '$.name' AS name,
    json ->> '$.friends[0]' AS first_friend,
    json ->> '$.likes.pizza' AS likes_pizza,
    json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl')
*  duckdb://
Done.
name first_friend likes_pizza likes_tacos
John Jake true true
Jake John false true
Kelly John true true
Sam Kelly false true

Looks like everybody likes tacos!

Extract schema#

Infer the JSON schema:

%%sql
SELECT
    json_structure(json),
    json_structure(json ->> '$.likes'),
FROM read_json_objects('people.jsonl')
*  duckdb://
Done.
json_structure("json") json_structure(("json" ->> '$.likes'))
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
{"name":"VARCHAR","age":"UBIGINT","friends":["VARCHAR"],"likes":{"pizza":"BOOLEAN","tacos":"BOOLEAN"}} {"pizza":"BOOLEAN","tacos":"BOOLEAN"}
%%sql schema <<
SELECT
    json_structure(json) AS schema_all,
    json_structure(json ->> '$.likes') AS schema_likes,
FROM read_json_objects('people.jsonl')
*  duckdb://
Done.

Pretty print the inferred schema:

from rich import print_json

row = schema.DataFrame().iloc[0]

print("Schema:")
print_json(row.schema_all)

print("\n\nSchema (likes):")
print_json(row.schema_likes)
Schema:
{
  "name": "VARCHAR",
  "age": "UBIGINT",
  "friends": [
    "VARCHAR"
  ],
  "likes": {
    "pizza": "BOOLEAN",
    "tacos": "BOOLEAN"
  }
}
Schema (likes):
{
  "pizza": "BOOLEAN",
  "tacos": "BOOLEAN"
}

Store snippets#

You can use JupySQL’s --save feature to store a SQL snippet so you can keep your queries succint:

%%sql --save clean_data_json
SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
FROM read_json_auto('people.json')
*  duckdb://
Done.
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True
%%sql --with clean_data_json
SELECT * FROM clean_data_json
*  duckdb://
Done.
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True

Or using our .jsonl file:

%%sql --save clean_data_jsonl
SELECT 
    json ->> '$.name' AS name,
    json ->> '$.friends[0]' AS first_friend,
    json ->> '$.likes.pizza' AS likes_pizza,
    json ->> '$.likes.tacos' AS likes_tacos
FROM read_json_objects('people.jsonl')
*  duckdb://
Done.
name first_friend likes_pizza likes_tacos
John Jake true true
Jake John false true
Kelly John true true
Sam Kelly false true
%%sql --with clean_data_jsonl
SELECT * FROM clean_data_jsonl
*  duckdb://
(duckdb.CatalogException) Catalog Error: Scalar Function with name json_extract_scalar does not exist!
Did you mean "json_extract"?
[SQL: WITH clean_data_jsonl AS (SELECT JSON_EXTRACT_SCALAR(json, '$.name') AS name, JSON_EXTRACT_SCALAR(json, '$.friends[0]') AS first_friend, JSON_EXTRACT_SCALAR(json, '$.likes.pizza') AS likes_pizza, JSON_EXTRACT_SCALAR(json, '$.likes.tacos') AS likes_tacos FROM READ_JSON_OBJECTS('people.jsonl')) SELECT * FROM clean_data_jsonl]
(Background on this error at: https://sqlalche.me/e/14/f405)

Export to CSV#

Note

Using --with isn’t supported when exporting to CSV.

To export to CSV:

%%sql
COPY (
    SELECT 
    name, 
    friends[1] AS first_friend, 
    likes.pizza AS likes_pizza, 
    likes.tacos AS likes_tacos
    FROM read_json_auto('people.json')
)

TO 'people.csv' (HEADER, DELIMITER ',');
*  duckdb://
Done.
Count
4
%%sql
SELECT * FROM 'people.csv'
*  duckdb://
Done.
name first_friend likes_pizza likes_tacos
John Jake True True
Jake John False True
Kelly John True True
Sam Kelly False True