Logo

4,222 Introduction to Programming

Lecture 9: Databases & SQL

Dr. Aurélien Sallin

2026-05-01

Two ways to look at data

Two ways to look at data

Two ways to look at data

Data in memory: pandas, (polars), dplyr

Data in a database: SQL

  • You load data into RAM and manipulate it with code
  • You send queries to a database, which returns only what you ask for
  • The data lives in memory, temporary, on RAM
  • The data lives in a database, persistent, on disk or in the cloud
  • Size is small and can fit in memory
  • Large datasets that don’t fit in memory
  • You work on the whole dataset at once
  • You operate on “parts/subsets” of the data through querying
  • For statistical analysis, plotting, typical data workflow in research, small projects
  • For data that needs to stay consistent over time, for data that is updated frequently, for data that is shared across multiple users/systems, for production applications

Goals for the next two weeks

CONSTRUCT a database

  • understand why databases exist and what problem they solve
  • know the relational model: tables, rows, keys, relationships

QUERY the database

  • write our first simple SQL queries with DuckDB
  • make more complex queries with JOINs, Groupings, subqueries, and CTEs (next week)

Note: If we can’t cover everything today, we will cover the rest next week. Slides marked with 👓 are slides that I will skip in the lecture if time is short, but that you should read at home.

What is a database?

What is a database?

Every app, every service, every research dataset you interact with is backed by a database.

Structured data

  • Customer records (name, address, country)
  • Trade flows (year, exporter, importer, value)
  • Survey responses (respondent_id, question, answer)

Unstructured data

  • Text, emails, news articles
  • Images, audio, video
A database is an organized collection of data.

Managing databases

A Database Management System (DBMS) is the software that sits between you and the data.
  • You send it a query: “Which countries exported more than 10B in 2022?”
  • It returns the answer.

You never touch or manipulate the raw data directly. The DBMS handles storage and access. It serves you the data “on a silver platter”.

Why use a DBMS?

Three elements are needed from your databases:

1. Data independence & integrity

Many people/systems read and write the same data. Data must stay correct regardless of who touches it.

  • Abstract view. No one touches raw storage
  • Constraints enforce consistency automatically
  • No conflicting CSV versions, no overwrites

2. Simple, efficient access

Apps, dashboards, and analysts all need to extract data quickly, without knowing where or how it is stored.

  • SQL is the standard interface across all databases
  • The query optimizer decides how to run your query. You just say what you want.
  • Multi-access: many users can query the same data at the same time, without conflicts.

3. Governance & reliability

The system cannot break down (no failed transaction, a lost record, a data breach).

  • The database protects the data in case of problems.
  • Recovery and backup mechanisms ensure data is not lost.
  • Access to the data is controlled and monitored.

Note: Check the 🔎 Self-Study for the ACID properties: the four guarantees that keep database transactions reliable.

The relational model

Data lives in tables

The fundamental idea of relational databases: data is organized into relations (tables).

Data lives in tables

The fundamental idea of relational databases: data is organized into relations (tables).

Each table/relation has:

  • A (relation) schema: the table name, column names, and their types (or domain)
  • Columns (attributes): one variable per column
  • Rows (records / tuples): one entry per observation

We can write the relation schema as:

Table(attribute_1: INT, attribute_2: VARCHAR, ...)

Check the 🔎 Self-Study for a quick note on the terminology of the relational model.

Data lives in tables

The fundamental idea of relational databases: data is organized into relations (tables).
Countries(country_id: INT, name: VARCHAR, region: VARCHAR, gdp_usd: FLOAT)
Trade_flows(flow_id: INT, exporter_id: INT, importer_id: INT, year: INT, value_usd: FLOAT)

Relational databases are built on relationships

A relational database is a collection of relations (tables) that are linked together through keys.

Relationships are built on keys

Primary key

  • Column (or set of columns) that uniquely identifies each row. If two tuples agree on the value(s) of the key, then they must be the same tuple.
  • It is NOT NULL and UNIQUE by definition
  • It allows other tables to refer to this table

Note: A good primary key is stable (never changes), unique, and meaningless on its own: a number, not the country name, which could change. If no natural attribute works as a key, you create a surrogate key (an artificial integer ID). When a single column isn’t enough, a composite key combines two or more columns.

Relationships are built on keys

Primary keys in our model?

Relationships are built on keys

Primary keys in our model?

  • country_id in Countries is the primary key. It uniquely identifies each country.
  • flow_id in Trade_Flows is the primary key for trade flows.

Relationships are built on keys

Foreign key

  • A column (or set of columns) that refers to a primary key in another table. It creates a link between the two tables.
  • Always in the referencing table (the “many” side of a one-to-many relationship)
  • A foreign key constraint means: you cannot insert a row with a foreign key value that does not exist in the referenced table.
  • This is known as referential integrity.

Relationships are built on keys

What are the foreign keys in our model?

Relationships are built on keys

What are the foreign keys in our model?

  • exporter_id and importer_id in trade_flows are foreign keys that refer to country_id in countries
  • The constraint: you cannot insert a trade flow for a country that does not exist in Countries. Every trade flow must reference an existing country.

Why split data into multiple tables?

Yes, why?

Why split data into multiple tables?

Imagine storing everything in one table:

flow_id exporter_name exporter_region importer_name importer_region year value
1 Switzerland Europe Germany Europe 2022 45M
2 Switzerland Europe France Europe 2022 38M
  • Redundant storage: "Europe" is 6 bytes; an integer ID is 4 bytes. Multiply by millions of rows, this becomes huge.
  • Update anomaly: if Switzerland changes region, one row to fix in countries. In a flat table, every trade flow row is rewritten.
  • Insertion anomaly: to add capital_city to Switzerland, you add one column to countries, and not across millions of rows.
  • Deletion anomaly: deleting all Swiss trade flows does not delete the information that Switzerland is in Europe.
Split tables eliminate redundancy and keep data consistent. This is called normalization.

An example of relational model: the star schema

The star schema ⭐

  • Most widely used to develop data warehouses (databases optimized for analytics)
  • Consists of one or more fact tables referencing any number of dimension tables.

Source: Wikipedia
  • Fact table (observations). Long data, one row per measurement
  • Dimension tables (countries, indicators). Metadata, usually wider because contains more attributes.
  • 🔑 The fact table holds foreign keys to all dimensions

The star schema ⭐

  • Moves repeated metadata into separate tables and replaces strings with integer IDs.
    -> normalization
  • Upside: consistency, compactness, flexibility. Downside: you need JOINs to get readable labels.

observations (fact table with millions of rows)

country_id indicator_id year value
3 1 2020 45 234
3 1 2021 46 100
3 2 2020 3.8
4 1 2020 38 600

countries (dimension table)

country_id name region
3 Germany Europe & Central Asia
4 France Europe & Central Asia

indicators (dimension table)

indicator_id name
1 GDP per capita
2 Unemployment rate

Note: Check the 🔎 Self-Study for a quick note on the star schema and normalization.

The star schema ⭐

We will explore a simple star schema with FRED data in the exercise session.

Create a database
Step 1: design the schema

Think in entities and relationships

Before creating a database, sketch the structure of your data. Think visually about your data structure and your schema design.

  1. Entities: a country, a firm, a survey respondent, a product.
  2. Attributes: the properties of each entity. In our example, each country has a name, a region, an income group.
  3. Relationships: how entities connect with each other.
    E.g., a trade flow links an exporter country to an importer country, or a survey response belongs to a respondent

Note: The concept of “entity” is close to the idea of “unit of observation” in statistics. It is the “thing” that you are collecting data about.

Write your schema in a simple text format first

For instance:

Countries(country_id: INT PK, name: VARCHAR, region: VARCHAR, gdp_usd: FLOAT)

Trade_flows(flow_id: INT PK, exporter_id: INT FK->Countries, importer_id: INT FK->Countries, year: INT, value_usd: FLOAT)
  • PK = primary key, FK = foreign key
  • FK→Countries means it references the Countries table
  • This is an informal shorthand for sketching schemas, not a formal standard. Different textbooks use different conventions.

Or use an ER diagram

Mermaid 🧜‍♀️ is a text-based diagram language for entity-relationship diagrams (ERDs), which renders directly in Quarto.

erDiagram
    COUNTRIES {
        int     country_id  PK
        varchar name
        varchar region
        float   gdp_usd
    }
    TRADE_FLOWS {
        int   flow_id      PK
        int   exporter_id  FK
        int   importer_id  FK
        int   year
        float value_usd
    }
    COUNTRIES ||--o{ TRADE_FLOWS : "exports"
    COUNTRIES ||--o{ TRADE_FLOWS : "imports"

erDiagram
    COUNTRIES {
        int     country_id  PK
        varchar name
        varchar region
        float   gdp_usd
    }
    TRADE_FLOWS {
        int   flow_id      PK
        int   exporter_id  FK
        int   importer_id  FK
        int   year
        float value_usd
    }
    COUNTRIES ||--o{ TRADE_FLOWS : "exports"
    COUNTRIES ||--o{ TRADE_FLOWS : "imports"

It uses crow’s foot notation 🐦‍⬛ to show relationships between tables: || = one and only one; o{ = zero or more; etc.

Thinking in entities and relationships

Once you have a sketch, translating to SQL will be easier:

  • Each entity → one table
  • Each attribute → one column, with a type
  • Each relationship → a foreign key

Create a database
Step 2: from sketch to schema using SQL

SQL — Structured Query Language

  • SQL is a query language.
  • You describe what you want, and the database figures out how to get it.
  • It is very high level, highly optimized, and has been around for decades.
  • Different “dialects” (standards and implementations, which are incompletely compatible)
  • 💪 Required skill for most data science positions in industry

Note: Originally based upon relational algebra and tuple relational calculus. We won’t go into details in this introductory course. I recommend Marco Venturini’s lecture for more details.

SQL — Structured Query Language

SQL has five main parts, or five main purposes:

  • DDL — Data Definition Language: CREATE TABLE, DROP TABLE, ALTER TABLE
  • DML — Data Manipulation Language: INSERT, UPDATE, DELETE records
  • DQL — Data Query Language: SELECT
  • DCL — Data Control Language (authorization): GRANT, REVOKE
  • TCL — Transaction Control Language (manages changes in transactions): COMMIT, ROLLBACK, SAVEPOINT, SET TRANSACTION

In this course, we’ll focus on DDL, DML and DQL.

The database universe

Hundreds of database products exist.

Source: A Primer on Databases

👓 The most important difference among the dbms is OLTP vs. OLAP

OLTP: Online Transaction Processing

  • One row at a time: inserts, updates, deletes
  • High write frequency, strict data integrity
  • Powers apps, payment systems, inventory
  • Running the business (claims, subscriptions, inventory movements)

OLAP: Online Analytical Processing

  • Millions of rows at once: aggregations, joins
  • Read-heavy, optimised for full-column scans
  • Powers dashboards, research, reporting
  • Understanding the business, tailored for analytics
  • Data warehousing
  • Both have different internal architectures, optimised for different goals.
  • Both have a SQL query interface.

Note: See Kleppmann’s Designing Data-Intensive Applications (O’Reilly, 2017, Chapter 3). The distinction boils down to three dimensions: storage layout (row-oriented vs column-oriented), indexing strategy, concurrency model.

In this course…

… we will use DuckDB
  • built for analytics and research
  • column-oriented, with a simple concurrency model (one writer, many readers)
  • optimized for analytical queries, aggregations, reads
  • embedded (runs inside Python/R, no server)
  • open source, free, and easy to set up: uv add duckdb
  • Reads CSV, Parquet, JSON directly
  • One db file, shareable

We will use DuckDB as DBMS

  • DuckDB is usable from Python and R. The project.db file can be accessed from both.
  • A typical workflow 🔄 : create a connection to a project.db file, create tables, query, get results, then close the connection.

Python

import duckdb

# Open (or create) a .db file
conn = duckdb.connect("project.db")

# Query → pandas DataFrame
df = conn.execute("""
    SELECT name, gdp_usd
    FROM   countries
    WHERE  region = 'Europe'
    ORDER BY gdp_usd DESC
""").df()

conn.close()

R

library(duckdb)
library(DBI)

# Open (or create) a .db file
con <- dbConnect(duckdb(), "project.db")

# Query → data.frame
df <- dbGetQuery(con, "
    SELECT name, gdp_usd
    FROM   countries
    WHERE  region = 'Europe'
    ORDER BY gdp_usd DESC
")

dbDisconnect(con, shutdown = TRUE)

Create a database
Step 3: ETL

Use the companion python script

It contains all the code to create the database and run queries.

OLTP, OLAP, and the ETL pipeline

Kleppmann, 2017, p. 88

Data is extracted from OLTP databases, transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse.

This process is known as Extract-Transform-Load (ETL)

Extract-Transform-Load (ETL)

A common pipeline that fits well with the type of work we do in (Econ) research.

Step What happens Tool
Extract (for Economists) Read raw data from its source, often messy and wide format pandas, fredapi, CSV, an app, a survey
Extract (production, industry) Extract from OLTP databases
Transform Clean, reshape, tidy data, assign IDs pandas
Load Insert into the database (warehouse), give it structure and queryable SQL INSERT

👓 The ETL pipeline

Why not load raw data directly into a database?

  • Raw data is often wide, messy, missing IDs
  • SQL has no melt(), no (advanced) cleaning
  • pandas or tidyverse/data.table in R are the right tool for reshaping

Why separate the raw data from the data warehouse?

  • Database is optimized for analytic access
  • No interference with the operation of the business

Conduct the Extract and Transform in python

Use the companion script.

Create a database
Step 4: create tables and load data

Building a database is a two-step process

  1. 🦴 Set up the architecture of the data through a data model.

    • Use SQL to create tables, define columns, and set constraints.
    • This is the “skeleton” of your database, the structure that holds your data together.
  1. Populate/Load the model with the transformed data (ETL)

    • Use python/R to clean and reshape your raw data, assign IDs, and prepare it for loading.
    • Use SQL INSERT statements to add rows to your tables.

CREATE our first TABLE

  • Creating a table means defining its schema: the column names, their types, and any constraints.
  • The schema is self-documenting: anyone reading it knows the rules.

It follows a simple syntax:

CREATE TABLE <table name> (
    <var name>  <var type> <constraint> ,
    <var name>  <var type> <constraint> ,
    ...
);

Atomic data types in SQL

  • The SQL standard defines a core set of data types.
  • Each dialect has its own extensions, but these are the most common:
    • Characters: VARCHAR[(length)] (Variable-length string, max n)
    • Numeric: INTEGER/INT (four-byte integers), BIGINT (64-bit integers), SMALLINT (16-bit integers), FLOAT (approximate floating-point), DECIMAL(p, s) (Exact fixed-point number with precision p and scale s)
    • Others: DATE, TIME, DATETIME/TIMESTAMP, INTERVAL, BOOLEAN

Note: FLOAT is used in this course for simplicity. For monetary values in production, prefer DECIMAL(p, s) to avoid floating-point rounding errors (e.g. DECIMAL(20, 2)). Refer to the documentation of each SQL dialect for the full list of supported types.

SQL constraints

Constraints enforce rules on your data at the database level. They are optional (but highly recommended to ensure data integrity and consistency).

Constraint What it does
NOT NULL Column must always have a value
UNIQUE No two rows can share the same value
PRIMARY KEY* NOT NULL + UNIQUE: the row’s unique identifier
FOREIGN KEY Value must exist in another table
CHECK Custom condition, e.g. CHECK (gdp_usd >= 0)
DEFAULT Auto-fills a value when none is provided

*When a single column is not enough to identify a row uniquely, use a composite primary key:

-- e.g. for an observations table: one row per (state, indicator, date)
PRIMARY KEY (state_id, indicator_id, date)

Create the table “countries”

CREATE TABLE countries (
    country_id INTEGER PRIMARY KEY,
    name       VARCHAR NOT NULL UNIQUE,
    region     VARCHAR,
    gdp_usd    FLOAT   CHECK (gdp_usd >= 0)
);

  • country_id: integer, the primary key: NOT NULL + UNIQUE by definition
  • name: text, required (NOT NULL), no duplicates (UNIQUE)
  • region: text, optional (no constraint — NULL is allowed)
  • gdp_usd: number, constraint: must be non-negative

Create the table “trade_flows”

CREATE TABLE trade_flows (
    flow_id     INTEGER PRIMARY KEY,
    exporter_id INTEGER NOT NULL REFERENCES countries(country_id),
    importer_id INTEGER NOT NULL REFERENCES countries(country_id),
    year        INTEGER NOT NULL,
    value_usd   FLOAT   CHECK (value_usd >= 0)
);

  • REFERENCES countries(country_id): foreign key constraint.
  • The value of exporter_id must match an existing country_id in the countries table. Same for importer_id.
  • Will reject any insert where exporter_id has no matching row in countries
  • Both exporter_id and importer_id point to the same table

Load using INSERT

The INSERT statement adds rows to a table. It must respect all constraints defined in the schema.

INSERT INTO <table> VALUES (1, 'Switzerland', 'Europe', 905000000000);

Or, DuckDB allows to import from CSV, Parquet, JSON directly:

INSERT INTO countries SELECT * FROM read_csv('data/countries_raw.csv');

Or from a pandas DataFrame (DuckDB-Python only: DuckDB recognises the variable df_flows in your Python scope):

INSERT INTO trade_flows SELECT * FROM df_flows

Load using INSERT in our example

  • Populate the tables with some rows using INSERT
-- Populate countries first (the referenced table must exist)
INSERT INTO countries VALUES (1, 'Switzerland', 'Europe', 905000000000);
INSERT INTO countries VALUES (2, 'Germany',     'Europe', 4072000000000);
INSERT INTO countries VALUES (3, 'Brazil',      'LATAM',  2081000000000);

-- Now populate trade flows
INSERT INTO trade_flows VALUES (1, 1, 2, 2022, 45000000);
INSERT INTO trade_flows VALUES (2, 2, 3, 2022, 120000000);

👓 What the foreign key actually prevents

1. Inserting a row with a non-existent reference

-- This works: country 1 exists in countries
INSERT INTO trade_flows VALUES (1, 1, 2, 2022, 45000000);

-- This fails: country 99 does not exist
INSERT INTO trade_flows VALUES (2, 99, 2, 2022, 10000000);
-- Error: FOREIGN KEY constraint failed

Why?

The database enforces referential integrity: every reference must point to an existing row. This prevents “orphan” records that reference non-existent entities.

Conclusion: Always insert into the referenced table first. You cannot insert a trade flow for a country that doesn’t exist yet.

👓 What the foreign key actually prevents

2. Deleting a referenced row

The constraint also works in the other direction: you cannot delete a row that is still referenced.

-- This fails: country 1 is referenced by trade_flows
DELETE FROM countries WHERE country_id = 1;
-- Error: Violates foreign key constraint
-- You must delete the child rows first
DELETE FROM trade_flows WHERE exporter_id = 1 OR importer_id = 1;
DELETE FROM countries  WHERE country_id = 1;  -- now this works
The database protects referential integrity in both directions: on INSERT and on DELETE.

Queries and basic SQL syntax

Querying a relational database

A query is a request for specific information from the database. It takes one or more tables as input and returns a new table as output.

country_id name region gdp_usd
1 Switzerland Europe 905000000000
2 Germany Europe 4072000000000
3 Brazil LATAM 2081000000000

What is Switzerland’s GDP?

SELECT gdp_usd
FROM   countries
WHERE  name = 'Switzerland';

Structure of a SQL query

  • A SQL query is made of clauses. Each clause has a specific purpose and must appear in a specific order.
  • The most common clauses are SELECT, FROM, WHERE, ORDER BY, and LIMIT.
  • Every query must have a SELECT and a FROM. The others are optional.

Source: w3

Everything starts with SELECT

Selects which columns to return:

SELECT column1, column2
    FROM   table_name
  • Use * for all.
  • List all columns you want, separated by commas.
  • You can also use expressions, e.g. SELECT gdp_usd / population AS gdp_per_capita, where AS renames the column in the output.
  • FROM the table you want.

Source: w3

Filtering with WHERE

Filters rows based on a condition:

SELECT column1, column2
    FROM   table_name
    WHERE  condition
  • You can use comparison operators (=, !=, <, >, <=, >=), logical operators (AND, OR, NOT), and functions in the condition.
  • The condition can reference any column from the FROM tables, but not aliases defined in the SELECT clause.

Source: w3

ORDER BY and LIMIT

Sort results and limit output:

SELECT gdp_usd
    FROM     countries
    WHERE    region = 'Europe'
    ORDER BY gdp_usd DESC
    LIMIT    5
  • ORDER BY sorts by column (ASC ascending, DESC descending)
  • LIMIT returns only the first N rows
  • LIMIT is useful for exploring large datasets. Always limit during exploration so you don’t wait for millions of rows.

Source: w3

The basic query structure

SQL clauses must appear in this order (💣 and not in any other order!!!)

-- Comment: the basic structure of a SQL query
SELECT   column1, column2   -- which columns to return
FROM     table_name         -- which table
WHERE    condition          -- filter rows (optional)
ORDER BY column ASC/DESC    -- sort (optional)
LIMIT    n                  -- truncate output (optional)
;
  • The database executes FROMWHERESELECTORDER BYLIMIT.
  • SELECT is written first but evaluated last.

Source: w3

👓 UPDATE, ALTER TABLE, DELETE

ALTER TABLE: modify the structure of a table

ALTER TABLE countries ADD COLUMN income_group VARCHAR;

UPDATE: modify existing rows

UPDATE countries
SET    income_group = 'High income'
WHERE  region = 'Europe';

DELETE: remove rows

DELETE FROM countries
WHERE  name = 'Switzerland';

Write SQL queries with DuckDB from Python

import duckdb

# Create a connection to a .db file
conn = duckdb.connect("week_09/trade.db")

# Query
conn.execute("""
    SQL QUERY
""").df()

# Close the connection when done
conn.close()
  • You have to open a database (or create one), using duckdb.connect("file.db"). If the file doesn’t exist, it will be created.
  • For read-only access to an existing file, use duckdb.connect("file.db", read_only=True).
  • Run queries using conn.execute("SQL QUERY").df() directly from Python (or R).
  • Queries are passed as strings, and the result is a new table.
  • Close the connection when done.
  • We usually convert the result to a pandas DataFrame for further analysis in Python with .df().

👓 Some useful operators and functions

COUNT, * and UNION ALL

  • The * operator in SELECT * means “all columns”.
  • The COUNT(*) function counts the number of rows in the result, regardless of NULLs.
  • UNION ALL combines results from two queries (keeping duplicates). UNION removes duplicates. It is like row binding in R or pandas.

Dates

  • Use EXTRACT(YEAR FROM date_column) to get the year from a date variable. Works with MONTH, DAY, etc.
  • Example: WHERE EXTRACT(YEAR FROM date) = 2020

👓 Some useful operators and functions

Strings

  • String functions: UPPER(), LOWER(), CONCAT(), SUBSTRING(), etc.
SELECT CONCAT(name, ' (', region, ')') AS label FROM countries;
-- → 'Switzerland (Europe)'

SELECT SUBSTRING(name, 1, 3) AS code FROM countries;
-- → 'Swi'

👓 Some observations on SQL syntax

Queries

  • All queries start with SELECT even if you don’t want to select any column (e.g. SELECT COUNT(*)).
  • Queries don’t modify data: only INSERT, UPDATE, DELETE do

Formatting

  • SQL ignores whitespace and line breaks: formatting is purely for readability
  • Strings use single quotes (‘states’), not double quotes
  • All keywords are case-insensitive, but it is a common convention to write them in CAPS to distinguish them from column names.
  • Column/table names are case-insensitive by default, but values are case-sensitive (‘CA’ ≠ ‘ca’)

👓 Some observations on SQL syntax

You are now a SQL user!

✅ You understand why we use databases

  • Why databases exist: memory limits, multiple users, data integrity
  • The relational model: tables, keys, relationships

✅ You can build a schema

  • Design entities and relationships with an ER diagram
  • Write CREATE TABLE with constraints (PK, FK, NOT NULL, CHECK)

✅ You can query

  • SELECT, WHERE, ORDER BY, LIMIT
  • Connect to DuckDB from Python and run queries with conn.execute()

Next time

  • This week (E09): build a star schema with FRED data and query it with SQL and DuckDB.
  • Next week: JOINs, aggregations, subqueries, and CTEs.

Sources

Self-study

A note on self-study

In lecture 9 on databases, none of the self-study sections are exam relevant.

🔎 Self-study: ACID properties

ACID is the set of four guarantees that make database transactions reliable:

Property What it means
Atomicity A transaction is all-or-nothing. If a bank transfer debits one account but crashes before crediting the other, the whole operation is rolled back.
Consistency A transaction can only bring the database from one valid state to another. It cannot violate constraints (PK, FK, NOT NULL, etc.).
Isolation Concurrent transactions don’t interfere with each other. Each sees the database as if it were running alone.
Durability Once a transaction is committed, it stays committed, even if the server crashes immediately after.

In practice for this course: the constraints you define with PRIMARY KEY, NOT NULL, REFERENCES enforce Consistency. The rest (Atomicity, Isolation, Durability) are handled automatically by the DBMS.

🔎 Self-study: Terminology

The relational model has formal names.

Formal (relational model) SQL / practical What it is
Relation Table The whole dataset/Table
Tuple / Record Row One observation
Attribute Column / Field One variable
Entry Value One cell
Relation schema Table definition The structure (name + column types)
Database schema Database definition All the tables and their relationships

You will see all these terms used interchangeably in documentation, Stack Overflow, and textbooks. They are the same concepts.

🔍 Self-study: The star schema ⭐ and normalization

A star schema sits in the middle of the normalization spectrum:

Flat table (0NF)

Everything in one table. Simple to query, but difficult to maintain. Strings are repeated everywhere, like in our example.

Star schema (partial)

Fact table is normalized: no redundancy in observations. The dimension tables are denormalized. For instance, region lives inside countries, not in a separate table.

Snowflake schema (full)

Dimension tables also split: countries → regions → continents. Maximum compactness, minimum redundancy. Cost: a huge number of JOINs, harder to query.

This slide is just a hint at the topic of normal forms and database formalizations (0NF, 1NF, 2NF, 3NF). We won’t have time to cover this in this introductory course.

🔎 Self-study: More on DuckDB

DuckDB speaks SQL, but it is an analytical database.

RDBMS (PostgreSQL, Oracle) DuckDB
Optimised for Writes, transactions Reads, aggregations
Storage layout Row-oriented Column-oriented
Setup Server process In-process (like pandas)
Concurrency Many writers One writer, many readers
Use case Production apps, OLTP Analytics, research, OLAP

🔎 Self-study: Where does each tool fit?

Two axes place every database product:

Self-hosted server Cloud / managed Embedded
OLTP Oracle, PostgreSQL, MySQL AWS Aurora, Cloud SQL SQLite
OLAP ClickHouse, Redshift Snowflake, BigQuery DuckDB
  • SQLite and DuckDB share the same architecture: both embedded, no server, single file
  • They differ on workload: SQLite stores app state (OLTP), DuckDB analyses data (OLAP)
  • DuckDB’s unique position: the only mature embedded OLAP engine

🔎 Self-study: What the foreign key actually prevents

3. Handling the deletion of a referenced row

The foreign key also controls what happens when you delete a referenced row.

CREATE TABLE trade_flows (
    flow_id     INTEGER PRIMARY KEY,
    exporter_id INTEGER NOT NULL
                REFERENCES countries(country_id) ON DELETE CASCADE,
    ...
);
Action Behaviour when you delete a country
ON DELETE RESTRICT ❌ Delete is rejected: the country is still referenced
ON DELETE CASCADE All trade flows for that country are deleted automatically
ON DELETE SET NULL exporter_id is set to NULL in all affected rows