Loading and Querying Excel Files#

In this tutorial, we will be using small financial data stored in an Excel file containing over 700 records. The dataset is publicly available here. We will use the read_excel function from the pandas library to read the Excel file and store it in the database using the %sql --persist command of jupysql, which works across multiple databases. For additional compatibility between different databases and jupysql, please check out this page.

Note

DuckDB doesn’t support reading excel files. Their excel extension provides excel like formatting.

Note

For this tutorial, we aim to showcase the versatility of jupysql as a framework by using --persist. However, DuckDB natively supports Pandas DataFrame and you do not need to use --persist. With DuckDB, complex queries such as aggregations and joins can run more efficiently on the DataFrame compared to Pandas native functions. You can refer to this blog for a detailed comparison (Note: the comparison is based on Pandas v1.*, not the recently released Pandas v2.*, which uses PyArrow as a backend).

Installing dependencies:

%pip install jupysql duckdb duckdb-engine pandas openpyxl --quiet
Note: you may need to restart the kernel to use updated packages.

Reading dataframe using pandas.read_excel:

import pandas as pd

df = pd.read_excel("https://go.microsoft.com/fwlink/?LinkID=521962")

Initializing jupysql and connecting to duckdb database

%load_ext sql
%sql duckdb://

Persisting the dataframe in duckdb database. It is stored in table named df.

# If you are using DuckDB, you can omit this cell
%sql --persist df
*  duckdb://
'Persisted df'

Running some standard queries#

  • Selecting first 3 queries

%%sql 
SELECT *
FROM df
LIMIT 3
*  duckdb://
Done.
Segment Country Product Discount Band Units Sold Manufacturing Price Sale Price Gross Sales Discounts Sales COGS Profit Date Month Number Month Name Year
Government Canada Carretera None 1618.5 3 20 32370.0 0.0 32370.0 16185.0 16185.0 2014-01-01 00:00:00 1 January 2014
Government Germany Carretera None 1321.0 3 20 26420.0 0.0 26420.0 13210.0 13210.0 2014-01-01 00:00:00 1 January 2014
Midmarket France Carretera None 2178.0 3 15 32670.0 0.0 32670.0 21780.0 10890.0 2014-06-01 00:00:00 6 June 2014
  • Countries in the database

%%sql 
SELECT DISTINCT Country
FROM df
*  duckdb://
Done.
Country
Canada
Germany
France
Mexico
United States of America
  • Evaluating total profit country-wise and ordering them in desceding order according to profit.

%%sql
select Country, SUM(Profit) Total_Profit
from df
group by Country
order by Total_Profit DESC
*  duckdb://
Done.
Country Total_Profit
France 3781020.7800000007
Germany 3680388.8200000008
Canada 3529228.8850000002
United States of America 2995540.664999999
Mexico 2907523.1100000003