Loading and Querying Excel Files#
In this tutorial, we will be using small financial data stored in an Excel file containing over 700 records. The dataset is publicly available here. We will use the read_excel
function from the pandas library to read the Excel file and store it in the database using the %sql --persist
command of jupysql, which works across multiple databases. For additional compatibility between different databases and jupysql, please check out this page.
Note
DuckDB doesn’t support reading excel files. Their excel
extension provides excel like formatting.
Note
For this tutorial, we aim to showcase the versatility of jupysql as a framework by using --persist
. However, DuckDB natively supports Pandas DataFrame and you do not need to use --persist
. With DuckDB, complex queries such as aggregations and joins can run more efficiently on the DataFrame compared to Pandas native functions. You can refer to this blog for a detailed comparison (Note: the comparison is based on Pandas v1.*, not the recently released Pandas v2.*, which uses PyArrow as a backend).
Installing dependencies:
%pip install jupysql duckdb duckdb-engine pandas openpyxl --quiet
Note: you may need to restart the kernel to use updated packages.
Reading dataframe using pandas.read_excel
:
import pandas as pd
df = pd.read_excel("https://go.microsoft.com/fwlink/?LinkID=521962")
Initializing jupysql and connecting to duckdb
database
%load_ext sql
%sql duckdb://
Persisting the dataframe in duckdb database. It is stored in table named df
.
# If you are using DuckDB, you can omit this cell
%sql --persist df
* duckdb://
'Persisted df'
Running some standard queries#
Selecting first 3 queries
%%sql
SELECT *
FROM df
LIMIT 3
* duckdb://
Done.
Segment | Country | Product | Discount Band | Units Sold | Manufacturing Price | Sale Price | Gross Sales | Discounts | Sales | COGS | Profit | Date | Month Number | Month Name | Year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Government | Canada | Carretera | None | 1618.5 | 3 | 20 | 32370.0 | 0.0 | 32370.0 | 16185.0 | 16185.0 | 2014-01-01 00:00:00 | 1 | January | 2014 |
Government | Germany | Carretera | None | 1321.0 | 3 | 20 | 26420.0 | 0.0 | 26420.0 | 13210.0 | 13210.0 | 2014-01-01 00:00:00 | 1 | January | 2014 |
Midmarket | France | Carretera | None | 2178.0 | 3 | 15 | 32670.0 | 0.0 | 32670.0 | 21780.0 | 10890.0 | 2014-06-01 00:00:00 | 6 | June | 2014 |
Countries in the database
%%sql
SELECT DISTINCT Country
FROM df
* duckdb://
Done.
Country |
---|
Canada |
Germany |
France |
Mexico |
United States of America |
Evaluating total profit country-wise and ordering them in desceding order according to profit.
%%sql
select Country, SUM(Profit) Total_Profit
from df
group by Country
order by Total_Profit DESC
* duckdb://
Done.
Country | Total_Profit |
---|---|
France | 3781020.7800000007 |
Germany | 3680388.8200000008 |
Canada | 3529228.8850000002 |
United States of America | 2995540.664999999 |
Mexico | 2907523.1100000003 |