You load data into RAM and manipulate it with code
You send queries to a database, which returns only what you ask for
The data lives in memory, temporary, on RAM
The data lives in a database, persistent, on disk or in the cloud
Size is small and can fit in memory
Large datasets that don’t fit in memory
You work on the whole dataset at once
You operate on “parts/subsets” of the data through querying
For statistical analysis, plotting, typical data workflow in research, small projects
For data that needs to stay consistent over time, for data that is updated frequently, for data that is shared across multiple users/systems, for production applications
Goals for the next two weeks
CONSTRUCT a database
understand why databases exist and what problem they solve
know the relational model: tables, rows, keys, relationships
QUERY the database
write our first simple SQL queries with DuckDB
make more complex queries with JOINs, Groupings, subqueries, and CTEs (next week)
Note: If we can’t cover everything today, we will cover the rest next week. Slides marked with 👓 are slides that I will skip in the lecture if time is short, but that you should read at home.
What is a database?
What is a database?
Every app, every service, every research dataset you interact with is backed by a database.
A relational database is a collection of relations (tables) that are linked together through keys.
Relationships are built on keys
Primary key
Column (or set of columns) that uniquely identifies each row. If two tuples agree on the value(s) of the key, then they must be the same tuple.
It is NOT NULL and UNIQUE by definition
It allows other tables to refer to this table
Note: A good primary key is stable (never changes), unique, and meaningless on its own: a number, not the country name, which could change. If no natural attribute works as a key, you create a surrogate key (an artificial integer ID). When a single column isn’t enough, a composite key combines two or more columns.
Relationships are built on keys
Primary keys in our model?
Relationships are built on keys
Primary keys in our model?
country_id in Countries is the primary key. It uniquely identifies each country.
flow_id in Trade_Flows is the primary key for trade flows.
Relationships are built on keys
Foreign key
A column (or set of columns) that refers to a primary key in another table. It creates a link between the two tables.
Always in the referencing table (the “many” side of a one-to-many relationship)
A foreign key constraint means: you cannot insert a row with a foreign key value that does not exist in the referenced table.
This is known as referential integrity.
Relationships are built on keys
What are the foreign keys in our model?
Relationships are built on keys
What are the foreign keys in our model?
exporter_id and importer_id in trade_flows are foreign keys that refer tocountry_id in countries
The constraint: you cannot insert a trade flow for a country that does not exist in Countries. Every trade flow must reference an existing country.
Why split data into multiple tables?
Yes, why?
Why split data into multiple tables?
Imagine storing everything in one table:
flow_id
exporter_name
exporter_region
importer_name
importer_region
year
value
1
Switzerland
Europe
Germany
Europe
2022
45M
2
Switzerland
Europe
France
Europe
2022
38M
Redundant storage: "Europe" is 6 bytes; an integer ID is 4 bytes. Multiply by millions of rows, this becomes huge.
Update anomaly: if Switzerland changes region, one row to fix in countries. In a flat table, every trade flow row is rewritten.
Insertion anomaly: to add capital_city to Switzerland, you add one column to countries, and not across millions of rows.
Deletion anomaly: deleting all Swiss trade flows does not delete the information that Switzerland is in Europe.
Split tables eliminate redundancy and keep data consistent. This is called normalization.
An example of relational model: the star schema
The star schema ⭐
Most widely used to develop data warehouses (databases optimized for analytics)
Consists of one or more fact tables referencing any number of dimension tables.
Fact table (observations). Long data, one row per measurement
Dimension tables (countries, indicators). Metadata, usually wider because contains more attributes.
🔑 The fact table holds foreign keys to all dimensions
The star schema ⭐
Moves repeated metadata into separate tables and replaces strings with integer IDs. -> normalization
Upside: consistency, compactness, flexibility. Downside: you need JOINs to get readable labels.
observations (fact table with millions of rows)
country_id
indicator_id
year
value
3
1
2020
45 234
3
1
2021
46 100
3
2
2020
3.8
4
1
2020
38 600
countries (dimension table)
country_id
name
region
3
Germany
Europe & Central Asia
4
France
Europe & Central Asia
indicators (dimension table)
indicator_id
name
1
GDP per capita
2
Unemployment rate
Note: Check the 🔎 Self-Study for a quick note on the star schema and normalization.
The star schema ⭐
We will explore a simple star schema with FRED data in the exercise session.
Create a database Step 1: design the schema
Think in entities and relationships
Before creating a database, sketch the structure of your data. Think visually about your data structure and your schema design.
Entities: a country, a firm, a survey respondent, a product.
Attributes: the properties of each entity. In our example, each country has a name, a region, an income group.
Relationships: how entities connect with each other. E.g., a trade flow links an exporter country to an importer country, or a survey response belongs to a respondent
Note: The concept of “entity” is close to the idea of “unit of observation” in statistics. It is the “thing” that you are collecting data about.
Write your schema in a simple text format first
For instance:
Countries(country_id: INT PK, name: VARCHAR, region: VARCHAR, gdp_usd: FLOAT)
Trade_flows(flow_id: INT PK, exporter_id: INT FK->Countries, importer_id: INT FK->Countries, year: INT, value_usd: FLOAT)
PK = primary key, FK = foreign key
FK→Countries means it references the Countries table
This is an informal shorthand for sketching schemas, not a formal standard. Different textbooks use different conventions.
Or use an ER diagram
Mermaid 🧜♀️ is a text-based diagram language for entity-relationship diagrams (ERDs), which renders directly in Quarto.
erDiagram
COUNTRIES {
int country_id PK
varchar name
varchar region
float gdp_usd
}
TRADE_FLOWS {
int flow_id PK
int exporter_id FK
int importer_id FK
int year
float value_usd
}
COUNTRIES ||--o{ TRADE_FLOWS : "exports"
COUNTRIES ||--o{ TRADE_FLOWS : "imports"
erDiagram
COUNTRIES {
int country_id PK
varchar name
varchar region
float gdp_usd
}
TRADE_FLOWS {
int flow_id PK
int exporter_id FK
int importer_id FK
int year
float value_usd
}
COUNTRIES ||--o{ TRADE_FLOWS : "exports"
COUNTRIES ||--o{ TRADE_FLOWS : "imports"
It uses crow’s foot notation 🐦⬛ to show relationships between tables: || = one and only one; o{ = zero or more; etc.
Thinking in entities and relationships
Once you have a sketch, translating to SQL will be easier:
Each entity → one table
Each attribute → one column, with a type
Each relationship → a foreign key
Create a database Step 2: from sketch to schema using SQL
SQL — Structured Query Language
SQL is a query language.
You describe what you want, and the database figures out how to get it.
It is very high level, highly optimized, and has been around for decades.
Different “dialects” (standards and implementations, which are incompletely compatible)
💪 Required skill for most data science positions in industry
Note: Originally based upon relational algebra and tuple relational calculus. We won’t go into details in this introductory course. I recommend Marco Venturini’s lecture for more details.
SQL — Structured Query Language
SQL has five main parts, or five main purposes:
DDL — Data Definition Language: CREATE TABLE, DROP TABLE, ALTER TABLE
DML — Data Manipulation Language: INSERT, UPDATE, DELETE records
DQL — Data Query Language: SELECT
DCL — Data Control Language (authorization): GRANT, REVOKE
TCL — Transaction Control Language (manages changes in transactions): COMMIT, ROLLBACK, SAVEPOINT, SET TRANSACTION
👓 The most important difference among the dbms is OLTP vs. OLAP
OLTP: Online Transaction Processing
One row at a time: inserts, updates, deletes
High write frequency, strict data integrity
Powers apps, payment systems, inventory
Running the business (claims, subscriptions, inventory movements)
OLAP: Online Analytical Processing
Millions of rows at once: aggregations, joins
Read-heavy, optimised for full-column scans
Powers dashboards, research, reporting
Understanding the business, tailored for analytics
Data warehousing
Both have different internal architectures, optimised for different goals.
Both have a SQL query interface.
Note: See Kleppmann’s Designing Data-Intensive Applications (O’Reilly, 2017, Chapter 3). The distinction boils down to three dimensions: storage layout (row-oriented vs column-oriented), indexing strategy, concurrency model.
In this course…
… we will use DuckDB
built for analytics and research
column-oriented, with a simple concurrency model (one writer, many readers)
optimized for analytical queries, aggregations, reads
embedded (runs inside Python/R, no server)
open source, free, and easy to set up: uv add duckdb
Reads CSV, Parquet, JSON directly
One db file, shareable
We will use DuckDB as DBMS
DuckDB is usable from Python and R. The project.db file can be accessed from both.
A typical workflow 🔄 : create a connection to a project.db file, create tables, query, get results, then close the connection.
Python
import duckdb# Open (or create) a .db fileconn = duckdb.connect("project.db")# Query → pandas DataFramedf = conn.execute(""" SELECT name, gdp_usd FROM countries WHERE region = 'Europe' ORDER BY gdp_usd DESC""").df()conn.close()
R
library(duckdb)library(DBI)# Open (or create) a .db filecon <-dbConnect(duckdb(), "project.db")# Query → data.framedf <-dbGetQuery(con, " SELECT name, gdp_usd FROM countries WHERE region = 'Europe' ORDER BY gdp_usd DESC")dbDisconnect(con, shutdown =TRUE)
Create a database Step 3: ETL
Use the companion python script
It contains all the code to create the database and run queries.
The SQL standard defines a core set of data types.
Each dialect has its own extensions, but these are the most common:
Characters: VARCHAR[(length)] (Variable-length string, max n)
Numeric: INTEGER/INT (four-byte integers), BIGINT (64-bit integers), SMALLINT (16-bit integers), FLOAT (approximate floating-point), DECIMAL(p, s) (Exact fixed-point number with precision p and scale s)
Others: DATE, TIME, DATETIME/TIMESTAMP, INTERVAL, BOOLEAN
Note:FLOAT is used in this course for simplicity. For monetary values in production, prefer DECIMAL(p, s) to avoid floating-point rounding errors (e.g. DECIMAL(20, 2)). Refer to the documentation of each SQL dialect for the full list of supported types.
SQL constraints
Constraints enforce rules on your data at the database level. They are optional (but highly recommended to ensure data integrity and consistency).
Constraint
What it does
NOT NULL
Column must always have a value
UNIQUE
No two rows can share the same value
PRIMARY KEY*
NOT NULL + UNIQUE: the row’s unique identifier
FOREIGN KEY
Value must exist in another table
CHECK
Custom condition, e.g. CHECK (gdp_usd >= 0)
DEFAULT
Auto-fills a value when none is provided
*When a single column is not enough to identify a row uniquely, use a composite primary key:
-- e.g. for an observations table: one row per (state, indicator, date)PRIMARYKEY (state_id, indicator_id, date)
Create the table “countries”
CREATETABLE countries ( country_id INTEGERPRIMARYKEY, name VARCHARNOTNULLUNIQUE, region VARCHAR, gdp_usd FLOATCHECK (gdp_usd >=0));
country_id: integer, the primary key: NOT NULL + UNIQUE by definition
name: text, required (NOT NULL), no duplicates (UNIQUE)
region: text, optional (no constraint — NULL is allowed)
Or, DuckDB allows to import from CSV, Parquet, JSON directly:
INSERTINTO countries SELECT*FROM read_csv('data/countries_raw.csv');
Or from a pandas DataFrame (DuckDB-Python only: DuckDB recognises the variable df_flows in your Python scope):
INSERTINTO trade_flows SELECT*FROM df_flows
Load using INSERT in our example
Populate the tables with some rows using INSERT
-- Populate countries first (the referenced table must exist)INSERTINTO countries VALUES (1, 'Switzerland', 'Europe', 905000000000);INSERTINTO countries VALUES (2, 'Germany', 'Europe', 4072000000000);INSERTINTO countries VALUES (3, 'Brazil', 'LATAM', 2081000000000);-- Now populate trade flowsINSERTINTO trade_flows VALUES (1, 1, 2, 2022, 45000000);INSERTINTO trade_flows VALUES (2, 2, 3, 2022, 120000000);
👓 What the foreign key actually prevents
1. Inserting a row with a non-existent reference
-- This works: country 1 exists in countriesINSERTINTO trade_flows VALUES (1, 1, 2, 2022, 45000000);-- This fails: country 99 does not existINSERTINTO trade_flows VALUES (2, 99, 2, 2022, 10000000);-- Error: FOREIGN KEY constraint failed
Why?
The database enforces referential integrity: every reference must point to an existing row. This prevents “orphan” records that reference non-existent entities.
Conclusion: Always insert into the referenced table first. You cannot insert a trade flow for a country that doesn’t exist yet.
👓 What the foreign key actually prevents
2. Deleting a referenced row
The constraint also works in the other direction: you cannot delete a row that is still referenced.
-- This fails: country 1 is referenced by trade_flowsDELETEFROM countries WHERE country_id =1;-- Error: Violates foreign key constraint
-- You must delete the child rows firstDELETEFROM trade_flows WHERE exporter_id =1OR importer_id =1;DELETEFROM countries WHERE country_id =1; -- now this works
The database protects referential integrity in both directions: on INSERT and on DELETE.
Queries and basic SQL syntax
Querying a relational database
A query is a request for specific information from the database. It takes one or more tables as input and returns a new table as output.
country_id
name
region
gdp_usd
1
Switzerland
Europe
905000000000
2
Germany
Europe
4072000000000
3
Brazil
LATAM
2081000000000
What is Switzerland’s GDP?
SELECT gdp_usdFROM countriesWHERE name ='Switzerland';
Structure of a SQL query
A SQL query is made of clauses. Each clause has a specific purpose and must appear in a specific order.
The most common clauses are SELECT, FROM, WHERE, ORDER BY, and LIMIT.
Every query must have a SELECT and a FROM. The others are optional.
SQL clauses must appear in this order (💣 and not in any other order!!!)
-- Comment: the basic structure of a SQL querySELECT column1, column2 -- which columns to returnFROM table_name -- which tableWHERE condition -- filter rows (optional)ORDERBYcolumnASC/DESC-- sort (optional)LIMIT n -- truncate output (optional);
The database executes FROM → WHERE → SELECT → ORDER BY → LIMIT.
ALTERTABLE countries ADDCOLUMN income_group VARCHAR;
UPDATE: modify existing rows
UPDATE countriesSET income_group ='High income'WHERE region ='Europe';
DELETE: remove rows
DELETEFROM countriesWHERE name ='Switzerland';
Write SQL queries with DuckDB from Python
import duckdb# Create a connection to a .db fileconn = duckdb.connect("week_09/trade.db")# Queryconn.execute(""" SQL QUERY""").df()# Close the connection when doneconn.close()
You have to open a database (or create one), using duckdb.connect("file.db"). If the file doesn’t exist, it will be created.
For read-only access to an existing file, use duckdb.connect("file.db", read_only=True).
Run queries using conn.execute("SQL QUERY").df() directly from Python (or R).
Queries are passed as strings, and the result is a new table.
Close the connection when done.
We usually convert the result to a pandas DataFrame for further analysis in Python with .df().
👓 Some useful operators and functions
COUNT, * and UNION ALL
The * operator in SELECT * means “all columns”.
The COUNT(*) function counts the number of rows in the result, regardless of NULLs.
UNION ALL combines results from two queries (keeping duplicates). UNION removes duplicates. It is like row binding in R or pandas.
Dates
Use EXTRACT(YEAR FROM date_column) to get the year from a date variable. Works with MONTH, DAY, etc.
Example: WHERE EXTRACT(YEAR FROM date) = 2020
👓 Some useful operators and functions
Strings
String functions: UPPER(), LOWER(), CONCAT(), SUBSTRING(), etc.
SELECTCONCAT(name, ' (', region, ')') ASlabelFROM countries;-- → 'Switzerland (Europe)'SELECT SUBSTRING(name, 1, 3) AS code FROM countries;-- → 'Swi'
👓 Some observations on SQL syntax
Queries
All queries start with SELECT even if you don’t want to select any column (e.g. SELECT COUNT(*)).
Queries don’t modify data: only INSERT, UPDATE, DELETE do
Formatting
SQL ignores whitespace and line breaks: formatting is purely for readability
Strings use single quotes (‘states’), not double quotes
All keywords are case-insensitive, but it is a common convention to write them in CAPS to distinguish them from column names.
Column/table names are case-insensitive by default, but values are case-sensitive (‘CA’ ≠ ‘ca’)
👓 Some observations on SQL syntax
You are now a SQL user!
✅ You understand why we use databases
Why databases exist: memory limits, multiple users, data integrity
The relational model: tables, keys, relationships
✅ You can build a schema
Design entities and relationships with an ER diagram
Write CREATE TABLE with constraints (PK, FK, NOT NULL, CHECK)
✅ You can query
SELECT, WHERE, ORDER BY, LIMIT
Connect to DuckDB from Python and run queries with conn.execute()
Next time
This week (E09): build a star schema with FRED data and query it with SQL and DuckDB.
Next week: JOINs, aggregations, subqueries, and CTEs.
Jacob Montiel, sql-intro for the motivation and intro structure
Marco Venturini, Data Handling: Databases (HSG, 2025). Course on relational model and schema concepts, 6 lectures going deeper in the topic of relational models and relational algebra
In lecture 9 on databases, none of the self-study sections are exam relevant.
🔎 Self-study: ACID properties
ACID is the set of four guarantees that make database transactions reliable:
Property
What it means
Atomicity
A transaction is all-or-nothing. If a bank transfer debits one account but crashes before crediting the other, the whole operation is rolled back.
Consistency
A transaction can only bring the database from one valid state to another. It cannot violate constraints (PK, FK, NOT NULL, etc.).
Isolation
Concurrent transactions don’t interfere with each other. Each sees the database as if it were running alone.
Durability
Once a transaction is committed, it stays committed, even if the server crashes immediately after.
In practice for this course: the constraints you define with PRIMARY KEY, NOT NULL, REFERENCES enforce Consistency. The rest (Atomicity, Isolation, Durability) are handled automatically by the DBMS.
🔎 Self-study: Terminology
The relational model has formal names.
Formal (relational model)
SQL / practical
What it is
Relation
Table
The whole dataset/Table
Tuple / Record
Row
One observation
Attribute
Column / Field
One variable
Entry
Value
One cell
Relation schema
Table definition
The structure (name + column types)
Database schema
Database definition
All the tables and their relationships
You will see all these terms used interchangeably in documentation, Stack Overflow, and textbooks. They are the same concepts.
🔍 Self-study: The star schema ⭐ and normalization
A star schema sits in the middle of the normalization spectrum:
Flat table (0NF)
Everything in one table. Simple to query, but difficult to maintain. Strings are repeated everywhere, like in our example.
Star schema (partial)
Fact table is normalized: no redundancy in observations. The dimension tables are denormalized. For instance, region lives inside countries, not in a separate table.
Snowflake schema (full)
Dimension tables also split: countries → regions → continents. Maximum compactness, minimum redundancy. Cost: a huge number of JOINs, harder to query.
This slide is just a hint at the topic of normal forms and database formalizations (0NF, 1NF, 2NF, 3NF). We won’t have time to cover this in this introductory course.
🔎 Self-study: More on DuckDB
DuckDB speaks SQL, but it is an analytical database.
RDBMS (PostgreSQL, Oracle)
DuckDB
Optimised for
Writes, transactions
Reads, aggregations
Storage layout
Row-oriented
Column-oriented
Setup
Server process
In-process (like pandas)
Concurrency
Many writers
One writer, many readers
Use case
Production apps, OLTP
Analytics, research, OLAP
🔎 Self-study: Where does each tool fit?
Two axes place every database product:
Self-hosted server
Cloud / managed
Embedded
OLTP
Oracle, PostgreSQL, MySQL
AWS Aurora, Cloud SQL
SQLite
OLAP
ClickHouse, Redshift
Snowflake, BigQuery
DuckDB
SQLite and DuckDB share the same architecture: both embedded, no server, single file
They differ on workload: SQLite stores app state (OLTP), DuckDB analyses data (OLAP)
DuckDB’s unique position: the only mature embedded OLAP engine
🔎 Self-study: What the foreign key actually prevents
3. Handling the deletion of a referenced row
The foreign key also controls what happens when you delete a referenced row.