Logo

4,222 Introduction to Programming

Lecture 10: Queries

Dr. Aurélien Sallin

2026-05-08

Today

Today: Querying

  • Review the queries introduced in the previous lecture
  • Combine tables with JOIN
  • Write aggregate queries: GROUP BY, COUNT, AVG, HAVING
  • If time 👓 CTEs and window functions

Recap

Two ways to look at data

The benefits of databases

Relational model

  • Data is organized in tables (relations) with rows (records) and columns (attributes).
  • Each table has a primary key that uniquely identifies each record.
  • Tables can be linked through foreign keys, allowing for complex relationships between data.

Relational model

  1. We used a relational model with a countries and a trade_flows table in the lecture.
  2. We explored a star schema ⭐ with a fact table (observations) and dimension tables (states, indicators) in the exercises using FRED data.

FRED data model using ER diagrams and mermaid 🧜‍♀️

  • states and indicators are dimension tables: they describe the entities being measured.
  • observations is the fact table: it holds the measurements and references the dimensions via foreign keys.
states(state_id: INT PK, abbr: VARCHAR UNIQUE, name: VARCHAR, census_region: VARCHAR, census_division: VARCHAR)

indicators(indicator_id: INT PK, suffix: VARCHAR UNIQUE, title: VARCHAR, units: VARCHAR, frequency: VARCHAR)

observations(state_id: INT FK->states, indicator_id: INT FK->indicators, date: DATE, value: DOUBLE)
  PK: (state_id, indicator_id, date)

erDiagram
    STATES {
        int     state_id        PK
        varchar abbr
        varchar name
        varchar census_region
        varchar census_division
    }
    INDICATORS {
        int     indicator_id    PK
        varchar suffix
        varchar title
        varchar units
        varchar frequency
    }
    OBSERVATIONS {
        int    state_id      FK
        int    indicator_id  FK
        date   date
        double value
    }
    STATES       ||--o{ OBSERVATIONS : "measured in"
    INDICATORS   ||--o{ OBSERVATIONS : "recorded as"

Constructing a database and the ETL process

Note: This graph is my own representation of the process. See the Appendix for the process in text.

Write SQL queries with DuckDB from Python

Our Data Management System (DBMS) is the software that manages the database and allows us to interact with it using SQL.

We will use DuckDB

  • built for analytics and research
  • embedded (runs inside Python/R, no server)
  • Reads CSV, Parquet, JSON directly
  • One .db file, shareable

How it works in Python

  • Open (or create) a database;
  • Queries are passed as strings, and the result is a new table converted to a pandas DataFrame with .df();
  • Close the connection when done.
#| eval: false
import duckdb

# Create a connection to a .db file
conn = duckdb.connect("week_10/trade.db")

# Query
conn.execute("""
    SQL QUERY
""").df()

# Close the connection when done
conn.close()

SQL has five main purposes

Data Definition Language DDL

  • Use CREATE TABLE to define the structure of a table (columns, data types, constraints).
  • Use ALTER TABLE to modify the structure of an existing table (e.g. add a column). ⚠️

Data Manipulation Language DML

  • Use INSERT INTO to add data to the table, either row by row or in bulk.
  • Use UPDATE to modify existing rows (based on a condition). ⚠️
  • Use DELETE to remove rows based on a condition. ⚠️

Data Query Language DQL

  • Use SELECT to retrieve data from the database.

Data Control Language DCL + Transaction Control Language TCL

⚠️ Covered in the take-home exercise

The basic query structure

A query is a request for specific information from the database. It takes one or more tables as input and returns a new table as output.

SQL clauses always follow a specific order:

-- Comment: the basic structure of a SQL query
SELECT   column1, column2   -- which columns to return
FROM     table_name         -- which table
WHERE    condition          -- filter rows (optional)
ORDER BY column ASC/DESC    -- sort (optional)
LIMIT    n                  -- truncate output (optional)
;
  • The database executes
    FROMWHERESELECTORDER BYLIMIT.
  • SELECT is written first but evaluated last.

Source: w3

The role of databases in your group project 🧑‍🎓👩‍🎓

  • Principle: all data flows from the database. Your scripts should read from the database, not from raw files.
  • Simulate a “firm” or “organization” environment where the data is shared to analysts (you 🫵) through queries.

Advantages

  • All your analyses are based on the same cleaned and transformed data (consistency)
  • You can easily update your results when new data comes in by simply reloading the database.

Queries

Let’s work!

  1. Go on https://www.cepii.fr/CEPII/en/bdd_modele/bdd_modele.asp
  2. Download the “BACI” dataset (trade flows between countries) named “HS22”.
  • This contains bilateral trade flows for 200 countries at the product level (5000 products) from 2022 to 2024.
  • Products use the “Harmonized System” nomenclature (6-digit codes).
  1. Unzip the file in the week_10/data folder. You should have the following structure:
course_materials/
├── week_10/
│   ├── build_cepii_db.py
│   ├── data/
│   │   ├── BACI_HS22_V202601
│   │   │   ├── BACI_HS22_Y2022_V202601.csv
│   │   │   ├── BACI_HS22_Y2023_V202601.csv
│   │   │   ├── BACI_HS22_Y2024_V202601.csv
│   │   │   ├── country_codes_V202601.csv
│   │   │   ├── product_codes_HS22_V202601.csv
  1. Initialize your environment using uv init --python 3.13 and uv add pandas duckdb skimpy.
  2. Open the build_cepii_db.py script and code along with me.

Let’s work!

  • We will create a database from a real dataset and run queries on it.
  • Open the python script and code along with me.
  • This will be a more hands-on session, and we will cover more advanced queries than in the previous lecture.

Extract and sketch the model

  • Section # ── EXTRACT ─────────────────────
  • Section # ── Sketch ─────────────────────

CREATE our TABLEs

  • Creating a table means defining its schema: the column names, their types, and any constraints.
  • The schema is self-documenting: anyone reading it knows the rules.

It follows a simple syntax:

CREATE TABLE <table name> (
    <var name>  <var type> <constraint> ,
    <var name>  <var type> <constraint> ,
    ...
);

Create tables, load data

  • Section # ── Create ─────────────────────
  • Section # ── Load ─────────────────────

Queries: some useful operators and functions

COUNT, * and UNION ALL

  • The * operator in SELECT * means “all columns”.
  • The COUNT(*) function counts the number of rows in the result, regardless of NULLs.
  • UNION ALL combines results from two queries (keeping duplicates). UNION removes duplicates. It is like row binding in R or pandas.

Query data

Section ── Queries ─────────────

Some useful operators and functions

Strings

  • String functions: UPPER(), LOWER(), CONCAT(), SUBSTRING(), etc.
SELECT CONCAT(name, ' (', region, ')') AS label FROM countries;
-- → 'Switzerland (Europe)'

SELECT SUBSTRING(name, 1, 3) AS code FROM countries;
-- → 'Swi'

Some observations on SQL syntax

Queries

  • All queries start with SELECT even if you don’t want to select any column (e.g. SELECT COUNT(*)).
  • Queries don’t modify data: only INSERT, UPDATE, DELETE do

Formatting

  • SQL ignores whitespace and line breaks: formatting is purely for readability
  • Strings use single quotes (‘states’), not double quotes
  • All keywords are case-insensitive, but it is a common convention to write them in CAPS to distinguish them from column names.
  • Column/table names are case-insensitive by default, but values are case-sensitive (‘CA’ ≠ ‘ca’)

JOINS

Joins

Bring columns from two tables together by matching keys (often a foreign key).

  • In BACI, we have exporter codes; without a JOIN we only see numbers…

Different types based on how rows from two tables are matched and combined:

  • INNER JOIN: returns only matching rows from both tables
  • LEFT JOIN: returns all rows from the left table and matching rows from the right table (NULL if no match, useful to find gaps)
  • RIGHT JOIN: returns all rows from the right table and matching rows from the left table (NULL if no match)
  • FULL OUTER JOIN: returns all rows when there is a match in either left or right table (NULL if no match in either)
Remember the course data handling: same logic, different tools (SQL vs R)

Common syntax

The most common join syntax in DuckDB follows this pattern:

SELECT *
  FROM table1 t1
  JOIN table2 t2
    ON t1.column = t2.column
  • JOIN ... ON: the matching condition (the foreign key relationship)
  • t1, t2 are table aliases: shorthand written after the table name
  • You can chain multiple JOINs to reach all the dimensions you need

Types of joins

SELECT TableA.column1,
    TableB.column2,
    ....
  FROM TableA
  INNER JOIN TableB
    ON TableA.matching_column = TableB.matching_column;

SELECT TableA.column1,
    TableB.column1,
    ....
  FROM TableA
  FULL JOIN TableB
    ON TableA.matching_column = TableB.matching_column;

Note: FULL JOIN = FULL OUTER JOIN (the OUTER keyword is optional), JOIN = INNER JOIN.

Right and left joins

SELECT TableA.column1,
    TableB.column1,
    ....
  FROM TableA
  LEFT JOIN TableB
    ON TableA.matching_column = TableB.matching_column;

SELECT TableA.column1,
    TableB.column1,
    ....
  FROM TableA
  RIGHT JOIN TableB
    ON TableA.matching_column = TableB.matching_column;

Cross join

CROSS JOIN creates a Cartesian product of two tables:

  • returns every possible combination of rows from both tables.
  • does not use a join condition
  • the total number of rows in the result is the number of rows in the first table multiplied by the number of rows in the second table
  • useful for generating all combinations of data
  • dangerous 💣

Remember:

Joining tables

Section ── JOINS ─────────────

Group by and aggregation

Summarize data with GROUP BY

GROUP BY collapses rows that share a value into a single summary row
  • It is often used with aggregate functions (e.g. COUNT, AVG, SUM, MIN, MAX) to perform calculations on each group of data.
  • Typical analytical questions: “how many per group?”, “what’s the average per group?”, “what’s the total per group?”
SELECT   region,
         COUNT(*) AS n_countries
FROM     countries
GROUP BY region
ORDER BY n_countries DESC

⚠️ SELECT can only reference GROUP BY columns or aggregate functions (common error!)

GROUP BY: common aggregate functions

Function What it computes
COUNT(*) Number of rows in the group
AVG(col) Mean value
SUM(col) Total
MIN(col) / MAX(col) Extremes

Aggregation

Section ── GROUP BY ─────────────

HAVING: filter on groups

WHERE filters rows (before grouping). HAVING filters groups (after grouping).

SELECT   exporter_id,
         SUM(value)     AS total_exports
FROM     flows
WHERE    year = 2022
GROUP BY exporter_id
HAVING   SUM(value) > 1000000
ORDER BY total_exports DESC
  • If the condition uses an aggregate (COUNT, AVG, …), it goes in HAVING. Everything else goes in WHERE.

Filter on aggregations

Section ── HAVING ─────────────

The full query order

SQL clauses must appear in this order:

SELECT   ...          -- 6. choose output columns
FROM     ...          -- 1. which table(s)
JOIN     ... ON ...   -- 2. combine tables
WHERE    ...          -- 3. filter rows
GROUP BY ...          -- 4. group rows
HAVING   ...          -- 5. filter groups
ORDER BY ...          -- 7. sort
LIMIT    ...          -- 8. truncate
The database executes FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT.

The numbers show execution order, not writing order. SELECT is written first but evaluated sixth, which is why you cannot use a SELECT alias in a WHERE clause.

👓 Timing test duckdb vs pandas

Section ── TIMING ─────────────

👓 Advanced

CTES are reusable subqueries

A Common Table Expression names an intermediate result so you can reference it like a table.

WITH gdp_2020 AS (
    SELECT   c.name,
             c.region,
             o.value   AS gdp_per_capita
    FROM     observations  o
    JOIN     countries     c  ON o.country_id   = c.country_id
    JOIN     indicators    i  ON o.indicator_id = i.indicator_id
    WHERE    i.code = 'NY.GDP.PCAP.CD'
      AND    o.year = 2020
),
region_avg AS (
    SELECT   region,
             ROUND(AVG(gdp_per_capita), 0) AS avg_gdp
    FROM     gdp_2020
    GROUP BY region
)
SELECT * FROM region_avg
ORDER BY avg_gdp DESC
  • Chain as many CTEs as you need, separated by commas
  • Each CTE can reference the ones defined before it

CTES can build on each other

Source: datawithbaraa.com

Advanced: CTE

Section ── CTE ─────────────

Window functions

Window functions

  • Perform calculations across a set of rows related to the current row, without collapsing them into groups
  • Allow you to compute running totals, ranks, and other cumulative metrics while still keeping the individual row details
  • Aggregation OVER a “window” of rows defined by PARTITION BY or ORDER BY
  • PARTITION BY: restart the window for each partition (like GROUP BY, but rows stay separate)
  • ORDER BY inside OVER: defines the sequence within each partition
  • LAG(col, n): value from n rows back; LEAD(col, n): value from n rows ahead
  • ROW_NUMBER(): sequential rank within each partition

Other useful window functions: RANK(), DENSE_RANK(), SUM(...) OVER (...) for running totals.

Advanced: Window functions

Section ── Window Function ─────────────

Looking ahead

What we covered today

  • Aggregation: GROUP BY, COUNT, AVG, SUM, HAVING
  • JOINs: INNER JOIN and LEFT JOIN to combine tables on foreign keys
  • CTEs: WITH to build readable, reusable subqueries

What you can now do

  • Write a SELECT query with WHERE, ORDER BY, and LIMIT
  • Combine tables using INNER JOIN and LEFT JOIN on a foreign key
  • Write aggregate queries with GROUP BY and filter groups with HAVING
  • Break a complex query into readable steps with a CTE

Sources

  • Simon Aubury & Ned Letcher, Getting Started with DuckDB (Packt, 2024)
  • Marco Venturini, Data Handling: Databases (HSG, 2025): DDL, constraints, ER diagrams
  • DuckDB documentation: https://duckdb.org/docs/

Appendix

Summary: the benefits of databases

Infographics from intellspot

Constructing a database and the ETL process

  1. 🦴 Set up the skeleton of the data through a data model.

    1. Design the schema: define the tables, columns, data types, constraints, and relationships.
    2. Create the database: use SQL commands to create the tables and set up constraints
  1. Populate/Load the model with the transformed data (ETL)

    1. E xtract: gather data from various sources (e.g. CSV files, APIs, other databases).
    2. T ransform: clean and format the data to fit the schema (e.g. handle missing values, convert data types, normalize). se python/R to clean and reshape your raw data, assign IDs, and prepare it for loading.
    3. L oad: insert the transformed data into the database using SQL commands or bulk loading