Loading and Querying Excel Files

Loading and Querying Excel Files#

In this tutorial, we will be using small financial data stored in an Excel file containing over 700 records. The dataset is publicly available here. We will use the read_excel function from the pandas library to read the Excel file and store it in the database using the %sql --persist command of jupysql, which works across multiple databases. For additional compatibility between different databases and jupysql, please check out this page.

Note

DuckDB doesn’t support reading excel files. Their excel extension provides excel like formatting.

Note

For this tutorial, we aim to showcase the versatility of jupysql as a framework by using --persist. However, DuckDB natively supports Pandas DataFrame and you do not need to use --persist. With DuckDB, complex queries such as aggregations and joins can run more efficiently on the DataFrame compared to Pandas native functions. You can refer to this blog for a detailed comparison (Note: the comparison is based on Pandas v1.*, not the recently released Pandas v2.*, which uses PyArrow as a backend).

Installing dependencies:

%pip install jupysql duckdb duckdb-engine pandas openpyxl --quiet

Note: you may need to restart the kernel to use updated packages.

Reading dataframe using pandas.read_excel:

import pandas as pd

df = pd.read_excel("https://go.microsoft.com/fwlink/?LinkID=521962")

Initializing jupysql and connecting to duckdb database

%load_ext sql
%sql duckdb://

Persisting the dataframe in duckdb database. It is stored in table named df.

# If you are using DuckDB, you can omit this cell
%sql --persist df

*  duckdb://

'Persisted df'

Running some standard queries#

Selecting first 3 queries

%%sql 
SELECT *
FROM df
LIMIT 3

*  duckdb://
Done.

Segment	Country	Product	Discount Band	Units Sold	Manufacturing Price	Sale Price	Gross Sales	Sales	COGS	Profit	Date	Month Number	Month Name	Year
Government	Canada	Carretera	None	1618.5	3	20	32370.0	32370.0	16185.0	16185.0	2014-01-01 00:00:00	1	January	2014
Government	Germany	Carretera	None	1321.0	3	20	26420.0	26420.0	13210.0	13210.0	2014-01-01 00:00:00	1	January	2014
Midmarket	France	Carretera	None	2178.0	3	15	32670.0	32670.0	21780.0	10890.0	2014-06-01 00:00:00	6	June	2014

Countries in the database

%%sql 
SELECT DISTINCT Country
FROM df

*  duckdb://
Done.

Country
Canada
Germany
France
Mexico
United States of America

Evaluating total profit country-wise and ordering them in desceding order according to profit.

%%sql
select Country, SUM(Profit) Total_Profit
from df
group by Country
order by Total_Profit DESC

*  duckdb://
Done.

Country	Total_Profit
France	3781020.7800000007
Germany	3680388.8200000008
Canada	3529228.8850000002
United States of America	2995540.664999999
Mexico	2907523.1100000003

Loading and Querying Excel Files

Contents

Loading and Querying Excel Files#

Running some standard queries#