import numpy as np
import pandas as pd4,222 Introduction to Programming
Week 5: Python for Data - Part I
Before We Begin:
Installing Packages
In Python, we use external “packages” to do heavy lifting—like analyzing data or drawing graphs. Since Python doesn’t come with these built-in, we need to download them.
We will use uv to download these packages and formally add them to our project. We need four core packages today:
numpy: A math package that works behind the scenes to make data calculations fast.pandas: For working with datasets (like Excel or CSV files).matplotlib: For drawing graphs and charts.ipykernel: The “engine” that allows VS Code to run Python code inside this notebook.
To install them open a new terminal, type:
uv add ipykernel pandas numpy matplotliband press Enter.
(Note: Some of these packages you have probably installed previously. Don’t worry, uv will check what is already installed and only add what isn’t.)
Recommended Folder Structure
If you want to work through this notebook yourself, test the code, and also experiment with it, we recommend extending the folder structure we set up previously. Set up a folder for week_05. Put your Jupyter notebook (my_analysis_week05.ipynb) there, and create a data subfolder where you save the dataset “countries_data.csv”.
Introduction_to_programming/
├── github_course_materials/
├── exercises/
│ ├── ...
├── group_project/
│ ├── ...
└── lectures/
└── week_05/ # <--- NEW: Create this folder
├── data/ # <--- NEW: Create this folder
│ └── countries_data.csv # NEW: Save the dataset
└── my_analysis_week05.ipynb
(Note: You can create your own folder structure if you prefer, but remember to adjust the file paths in your code later when we load or save dataframes!)
Pandas
Pandas is Python’s go-to library for working with tabular data—the kind you typically see in spreadsheets or statistical software. It gives you fast, convenient tools to
- import data (CSV, Excel, SQL),
- clean and reshape it (handle missing values, filter rows, create new variables, merge datasets), and
- summarize it (grouping, descriptive statistics)
in a way that’s both readable and reproducible. For economics work, pandas is especially useful because it makes it easy to build a transparent data pipeline: from raw data to analysis-ready datasets, and then on to figures and results you can trust and rerun.
When we import these packages, you will almost always see as np and as pd. Think of this as creating an official, time-saving nickname.
- By writing import pandas as pd, we tell Python: “Whenever I type pd, I mean pandas.”
- It saves typing: You get to use
pd.read_csv()instead of the full name. - It is the industry standard:
pdandnpare the universal conventions in the Python data science and economics community. Using these specific abbreviations ensures your code is instantly readable to anyone else sharing your work.
Numpy
NumPy (Numerical Python) is a math package that works behind the scenes to make data calculations fast. If Pandas is the spreadsheet, NumPy is the underlying calculator. In fact, Pandas is actually built on top of NumPy. For many tasks Pandas will use NumPy in the background without you needing to import it explicitly. But occasionally we need to do math that Pandas doesn’t handle on its own. Since they are the dynamic duo of Python data analysis, it is standard practice to just import both at the very top of your script so you have all your tools ready to go.
Reading Data
To get data into your workspace, we use reader functions. Since a lot of economic data comes as CSV (Comma-Separated Values) files, pd.read_csv() will likely be your most-used command.
Basic Syntax:
df = pd.read_csv("path/to/your/file.csv")When you run this, Pandas does three things:
- Finds the file on your computer (based on the path you give it).
- Splits the text into columns using a delimiter (usually a comma).
- Builds a DataFrame (a table-like object) and stores it in df (a common convention).
Real-world economic data is often messy. These options can rescue you when the default settings don’t work:
sep: If your file uses semicolons instead of commas, usesep=';'index_col: If the first column is a “Year” or “ID” and you want it to be the label for your rows, useindex_col=0.header: If your data starts on the 3rd row because there are two rows of titles/notes at the top, useheader=2.
If your data is not in csv format there are different functions to read the data:
pd.read_excel(): For excel files (.xlsx or .xls). It Unlike a CSV, an Excel file can have multiple sheets. You can specify the name (as a string) or the position (e.g., 0 for the first sheet using the argumentsheet_name).pd.read_stata(): For stata files (.dta)
Example: We’ll work with a dataset “countries_data.csv” located in the folder “data”. Read the data
# Load the dataset
df = pd.read_csv("data/countries_data.csv")After reading a dataset, always do a quick check that it worked. Datasets can be large, so you usually don’t want to print the entire table. Your safest first preview is: df.head()
# Preview of data
df.head()| iso3 | Country | year | hc | pop | rgdpe | xr | |
|---|---|---|---|---|---|---|---|
| 0 | CHE | Switzerland | 2000 | 3.528448 | 7.184007 | 368548.87500 | 1.688843 |
| 1 | CHE | Switzerland | 2001 | 3.538948 | 7.226391 | 369422.81250 | 1.687615 |
| 2 | CHE | Switzerland | 2002 | 3.549480 | 7.278752 | 366579.53125 | 1.558607 |
| 3 | CHE | Switzerland | 2003 | 3.560044 | 7.333447 | 362704.65625 | 1.346651 |
| 4 | CHE | Switzerland | 2004 | 3.570638 | 7.384194 | 370410.31250 | 1.243496 |
This shows the first few rows so you can confirm that:
- columns are aligned correctly,
- headers look right,
- and the values “make sense” at first glance.
FileNotFoundError
%xmode Minimal
df = pd.read_csv("my_data/countries_data.csv")Exception reporting mode: Minimal
FileNotFoundError: [Errno 2] No such file or directory: 'my_data/countries_data.csv'
you might get a massive block of red text ending in FileNotFoundError. Don’t panic! This simply means Python cannot find the file at the path you provided.
- Check your spelling: Ensure the file name is exactly “countries_data.csv” and the folder is exactly “data/”.
- Check your Working Directory: Python looks for the file relative to where you are
If you ever get completely lost, you can always run a cell
import os
os.getcwd()'C:\\Users\\aurel\\OneDrive\\Documents\\BEcon_4222_Introduction_Programming\\course_materials\\week_05'
to see exactly which folder Python thinks it is in right now!
Navigating folder paths:
"data/countries_data.csv": This tells Python, “Look inside my current location for a folder named data, and grab the file.”"../": This tells Python, “Take one step backward (up) out of the current folder.” So if you have a folderdatathat is not inweek05but one level above, this is how you can access it.
A First Look at Your Data
The Data
In this lecture we work with data from the Penn World Table. The Penn World Table (PWT) is a cross-country macroeconomic dataset that provides internationally comparable information on income, output, inputs, prices, and productivity. The current PWT 11.0 release covers 185 countries from 1950 to 2023 and is designed to make economic comparisons across countries and over time using purchasing power parity (PPP) methods.
The variables we use in this class to explore the functionality of Pandas are:
| Variable | Full Name / Definition |
|---|---|
| iso3 | 3-letter ISO Country Code |
| Country | Full Country Name |
| year | Observation Year |
| hc | Human capital index (based on years of schooling) |
| pop | Population (in millions) |
| rgdpe | Expenditure-side real GDP at chained PPPs (in mil. 2021US$) |
| xr | Exchange rate, national currency/USD (market+estimated) |
There are many other variables (and countries) available, you can download data from the PWT here
Inspect the DataFrame
Once your data is in a DataFrame, your first step is to inspect it. Pandas provides several built-in methods and attributes that help you quickly understand what you’re working with.
Methods (you call them)
Methods are functions attached to an object (here df), you call them using object.method(), (with parentheses!)
df.head(): shows the first rows of the DataFrame (default: 5)df.tail():shows the first rows of the DataFrame (default: 5)df.info(): shows the number of rows and columns, non-null counts (missing data), and data typesdf.describe(): for numeric columns, gives summary statistics (count, mean, std, min, quartiles, max)
Attributes (you can access them)
Attributes are stored information, so you don’t use parentheses.
df.shape: a tuple (number of rows, number of columns)df.columns: the column names (an index-like object)
# Why use it:
# To quickly verify that the data loaded correctly, headers are aligned,
# and the values make sense at first glance
df.head()| iso3 | Country | year | hc | pop | rgdpe | xr | |
|---|---|---|---|---|---|---|---|
| 0 | CHE | Switzerland | 2000 | 3.528448 | 7.184007 | 368548.87500 | 1.688843 |
| 1 | CHE | Switzerland | 2001 | 3.538948 | 7.226391 | 369422.81250 | 1.687615 |
| 2 | CHE | Switzerland | 2002 | 3.549480 | 7.278752 | 366579.53125 | 1.558607 |
| 3 | CHE | Switzerland | 2003 | 3.560044 | 7.333447 | 362704.65625 | 1.346651 |
| 4 | CHE | Switzerland | 2004 | 3.570638 | 7.384194 | 370410.31250 | 1.243496 |
# Why use it:
# Sometimes the data hasfootnotes or summary rows appended to the very bottom of the CSV.
# This helps you catch and remove that "junk" data before running your analysis.
df.tail()| iso3 | Country | year | hc | pop | rgdpe | xr | |
|---|---|---|---|---|---|---|---|
| 211 | USA | United States | 2019 | 3.749341 | 337.790067 | 23200364.0 | 1.0 |
| 212 | USA | United States | 2020 | 3.754667 | 339.436159 | 22676858.0 | 1.0 |
| 213 | USA | United States | 2021 | 3.728173 | 340.161441 | 24167636.0 | 1.0 |
| 214 | USA | United States | 2022 | 3.813643 | 341.534046 | 24861132.0 | 1.0 |
| 215 | USA | United States | 2023 | 3.829826 | 343.477335 | 25641846.0 | 1.0 |
# Why use it:
# This is your structural overview. It lets you instantly spot missing data or
# check if a numeric column like GDP was accidentally imported as text
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 iso3 216 non-null object
1 Country 216 non-null object
2 year 216 non-null int64
3 hc 216 non-null float64
4 pop 216 non-null float64
5 rgdpe 216 non-null float64
6 xr 216 non-null float64
dtypes: float64(4), int64(1), object(2)
memory usage: 11.9+ KB
When you run df.info(), you will see a column called Dtype. Pandas uses slightly different names for data types than standard Python. Pandas also has additional datatypes (such as datetime64) which we will cover in the next lecture.
# Why use it:
# Get a sense of the values in your DataFrame,
# Quick sanity check—for example, ensuring population values aren't negative
# or that max GDP isn't unrealistically high.
df.describe()| year | hc | pop | rgdpe | xr | |
|---|---|---|---|---|---|
| count | 216.000000 | 216.000000 | 216.000000 | 2.160000e+02 | 216.000000 |
| mean | 2011.500000 | 3.270894 | 244.145055 | 6.419584e+06 | 15.110337 |
| std | 6.938266 | 0.454432 | 404.613603 | 7.085834e+06 | 33.988611 |
| min | 2000.000000 | 2.313537 | 7.184007 | 3.627047e+05 | 0.499772 |
| 25% | 2005.750000 | 2.889263 | 60.954363 | 2.528470e+06 | 0.803922 |
| 50% | 2011.500000 | 3.508762 | 82.057528 | 3.286765e+06 | 1.000000 |
| 75% | 2017.250000 | 3.657557 | 128.089869 | 5.579467e+06 | 8.028658 |
| max | 2023.000000 | 3.846449 | 1426.437267 | 3.059032e+07 | 140.491100 |
# Why use it:
# Confirm your dataset is the size you expect
df.shape(216, 7)
# Why use it:
# Find out what variables you're working with.
df.columnsIndex(['iso3', 'Country', 'year', 'hc', 'pop', 'rgdpe', 'xr'], dtype='object')
As you learn Python, you will notice three different ways we interact with data. Understanding the difference now will save you a lot of syntax errors later!
- Functions (Standalone Tools): These are tools that live on their own. You bring your data into the tool.
- Syntax:
function_name(object) - Example:
print(df)orlen(df)
- Syntax:
- Methods (Built-in Actions): These are functions attached directly to an object (like a DataFrame). You are asking the dataset to do something to itself. Because it is an action, it always requires parentheses
().- Syntax:
object.method_name() - Example:
df.head()ordf.describe()
- Syntax:
- Attributes (Stored Facts): These are pieces of information or metadata saved inside the object. You are not asking the dataset to calculate anything new; you are just reading a fact. Because it is not an action, there are no parentheses.
- Syntax:
object.attribute_name - Example:
df.shapeordf.columns
- Syntax:
Selecting Data
Selecting Columns
- To grab one column and return a Series, use square brackets
[]with the column name (a string): - To grab multiple columns, pass a list of names inside the brackets. This is why you’ll see double brackets
[[]](outer[]for pandas selection, inner[]for the list). This returns a DataFrame (even if the list has only one column):
Quick rule: “one name” –> Series, [“list of names”] –> DataFrame.
Examples:
df['year'] # return 'year' column as a series0 2000
1 2001
2 2002
3 2003
4 2004
...
211 2019
212 2020
213 2021
214 2022
215 2023
Name: year, Length: 216, dtype: int64
df[['year', 'pop']] # returns 'year' and 'pop' column as a DataFrame| year | pop | |
|---|---|---|
| 0 | 2000 | 7.184007 |
| 1 | 2001 | 7.226391 |
| 2 | 2002 | 7.278752 |
| 3 | 2003 | 7.333447 |
| 4 | 2004 | 7.384194 |
| ... | ... | ... |
| 211 | 2019 | 337.790067 |
| 212 | 2020 | 339.436159 |
| 213 | 2021 | 340.161441 |
| 214 | 2022 | 341.534046 |
| 215 | 2023 | 343.477335 |
216 rows × 2 columns
Once you select a single column (a Series), you can use specific methods to explore it. For example,
.nunique(): Returns the number of unique values. (e.g., “Exactly how many different countries are in my dataset?”).unique(): Returns an array of the actual unique values. (e.g., “Give me a list of the names of all the countries included.”).value_counts(): Returns a count of how many times each unique value appears, sorted from most frequent to least frequent. (e.g., “How many years of data do I have for each country?”)
# How many unique countries are in the dataset?
df['Country'].nunique()9
# What are the names of those countries?
df['Country'].unique()array(['Switzerland', 'China', 'Germany', 'France', 'United Kingdom',
'Italy', 'Japan', 'Mexico', 'United States'], dtype=object)
# How many observations (years) do we have for each country?
df['Country'].value_counts()Country
Switzerland 24
China 24
Germany 24
France 24
United Kingdom 24
Italy 24
Japan 24
Mexico 24
United States 24
Name: count, dtype: int64
Selecting rows/columns by position: .iloc[]
.iloc[] selects by integer position, like standard Python indexing.
- Syntax:
df.iloc[row_position, column_position] - Counting starts at 0
- Slices are end-exclusive (just like Python): 0:3 gives rows 0, 1, 2
Examples:
df.iloc[0, 0] # first row, first column'CHE'
df.iloc[0:3, 0:2] # rows 0-2 and columns 0-1| iso3 | Country | |
|---|---|---|
| 0 | CHE | Switzerland |
| 1 | CHE | Switzerland |
| 2 | CHE | Switzerland |
Selecting rows/columns by label: .loc[]
.loc[] selects by labels (names) from the index and columns.
- Syntax:
df.loc[row_label, column_label]
df.loc[0, "rgdpe"] # if the index labels include 0368548.875
Important difference: label-based slicing with .loc is usually end-inclusive:
# first 6 rows (0,1,2,3,4,5) and column 'rgdpe'
df.loc[0:5, "rgdpe"]0 368548.87500
1 369422.81250
2 366579.53125
3 362704.65625
4 370410.31250
5 385585.37500
Name: rgdpe, dtype: float64
For both .loc and .iloc, using :(instead of label or index of columns or rows) means “everything” along that dimension:
df.loc[:,['year', 'pop']].head()| year | pop | |
|---|---|---|
| 0 | 2000 | 7.184007 |
| 1 | 2001 | 7.226391 |
| 2 | 2002 | 7.278752 |
| 3 | 2003 | 7.333447 |
| 4 | 2004 | 7.384194 |
[], .loc, or .iloc?
Pandas gives you several ways to subset your data, which can feel overwhelming. Here is the cheat sheet for what you should actually use in your research scripts:
- Use
[[]](Standard Brackets) when you ONLY need columns: If you just want to isolate a few variables this is the cleanest way.- Example:
df_small = df[['year', 'pop', 'gdp']]
- Example:
- Use
.loc[](Label-based) as your default for rows + columns:- If you need to select specific rows and columns, always prefer
.loc. Because it uses the names of the columns, your code is robust. If the dataset is updated next year and adds a new column at the beginning, your.loccode will still grab the correct variables.
- If you need to select specific rows and columns, always prefer
- Use
.iloc[](Position-based) sparingly:- Selecting data by its integer position (e.g., “give me the 4th column”) is risky in economics because datasets change. Use
.iloconly when position actually matters, such as grabbing the top 3 rows after sorting a dataset.
- Selecting data by its integer position (e.g., “give me the 4th column”) is risky in economics because datasets change. Use
Filtering
Filtering - also called Boolean indexing - is one of the most important pandas skills. In research, you almost never use the full “universe” of data. Instead you focus on specific countries, certain time periods, recession years, or groups defined by a threshold (e.g., high inflation, high debt-to-GDP).
Filtering is a two-step-process:
- The Comparison (build a mask): When you compare a column to a value, pandas doesn’t return one answer. It returns a True/False value for every row. This
True/FalseSeries is called a mask (it has the same length as the DataFrame).
- The Comparison (build a mask): When you compare a column to a value, pandas doesn’t return one answer. It returns a True/False value for every row. This
- The Selection/Filter (apply the mask): You can then use the mask in
df[]and it will only return the rows for which the mask isTrue
- The Selection/Filter (apply the mask): You can then use the mask in
Example: keep only observations from 2000 onward
# Step 1: Build a mask
mask = df["year"] >= 2010
mask0 False
1 False
2 False
3 False
4 False
...
211 True
212 True
213 True
214 True
215 True
Name: year, Length: 216, dtype: bool
# Step 2: Apply the mask
df_2010 = df[mask]
df_2010.head() | iso3 | Country | year | hc | pop | rgdpe | xr | |
|---|---|---|---|---|---|---|---|
| 10 | CHE | Switzerland | 2010 | 3.634872 | 7.825753 | 496048.96875 | 1.042906 |
| 11 | CHE | Switzerland | 2011 | 3.642345 | 7.914372 | 552769.12500 | 0.888042 |
| 12 | CHE | Switzerland | 2012 | 3.649832 | 7.998926 | 571063.31250 | 0.937684 |
| 13 | CHE | Switzerland | 2013 | 3.657335 | 8.091202 | 568455.56250 | 0.926904 |
| 14 | CHE | Switzerland | 2014 | 3.664854 | 8.190235 | 580136.00000 | 0.916151 |
Once you are comfortable with masks, you will often skip creating the ‘mask’ variable and put the condition directly inside the brackets:
df_recent = df[df["year"] >= 2015]This is perfectly fine, especially for simple filtering, but creating a mask variable explicitly can improve readability if the filtering is more complex.
Comparison tools you’ll use all the time
==: equal to!=: not equal to>/<: greater or less than>=/<=: greater than or equal to / less than or equal to.isin([...])membership in a list of values (very common for countries)
Example: Keep a set of countries
mask = df["iso3"].isin(['CHE', 'FRA', 'DEU', 'ITA']) # 1. build a mask
df_europe = df[mask] # 2. apply the mask
df_europe['Country'].value_counts() # did it work?Country
Switzerland 24
Germany 24
France 24
Italy 24
Name: count, dtype: int64
Note: if use filter your dataframe df[mask] it creates a filtered view/result, but it does not overwrite df. To keep it you either (i) assign it to a new name df_filtered = df[mask] or (ii) you overwrite df=df[mask], which makes it permanent for your session.
Usually it’s better to create a new object name (df_eurupe, df_che, etc.) so you don’t “lose” data by accident.
Filtering with multiple conditions (combining masks)
Often you want to filter using more than one criterion—for example, a set of countries and only years after 2015.
- Condition1:
df['iso3'].isin(['FRA', 'DEU', 'CHE']) - Condition2:
df['year'] > 2015
You can combine them with logical operators:
&(AND): Both conditions must be true|(OR): At least one condition must be True.~(NOT): Flips the mask, True becomes False and vice versa. So if it’s easier to define what you do not want to have in your dataframe you can use this.
Note: for lists we used and/or, but in pandas we use &/ |.
You must wrap each condition in parentheses ():
mask = (condition1) & (condition2)
Example: Filter observations for france, germany and switzerland after 2015
# example: mask for france, germany and switzerland for the years after 2015
mask = (df['iso3'].isin(['FRA', 'DEU', 'CHE'])) & (df['year'] > 2015)
df_filtered = df[mask]Example: Filter for years after 2015 for all countries except USA and China
# example: mask for years after 2015 for all countries except USA and China
countries_to_exclude = ['China', 'United States']
mask = (~df['Country'].isin(countries_to_exclude)) & (df['year'] > 2015)
# use the mask to filter, select 'Country' and check the countries remaining:
df_filtered = df[mask]
# Check whether USA and China are still in there:
df_filtered['Country'].value_counts()Country
Switzerland 8
Germany 8
France 8
United Kingdom 8
Italy 8
Japan 8
Mexico 8
Name: count, dtype: int64
Example with Error
# INCORRECT: Using 'and' instead of '&', and forgetting parentheses
mask = df['iso3'] == 'CHE' and df['year'] > 2015ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If you get an error when you create the mask, check your filtering code:
- Did you use
and/orinstead of&/|? - Did you forget the parentheses () around each individual condition?
Filtering and Selecting Simultaneously
So far, we have used df[mask] to filter our dataset. This returns all the columns for the rows that meet our condition.
But what if you only want to see the Country name and GDP for years where the population was over 100 million?
This is where .loc is actually the best tool for the job. You pass the mask as the row selector, and a list of columns as the column selector: df.loc[mask, columns]
# 1. Build the mask (Population over 100 million in the year 2023)
mask_large_pop = (df['year'] == 2023) & (df['pop'] > 100)
# 2. Define the exact columns you want to see
cols_to_keep = ['Country', 'pop', 'rgdpe']
# 3. Use .loc to apply the mask to the rows, and select the columns simultaneously
df_large = df.loc[mask_large_pop, cols_to_keep]
# See the result
df_large.head()| Country | pop | rgdpe | |
|---|---|---|---|
| 47 | China | 1422.584933 | 30590320.00 |
| 167 | Japan | 124.370947 | 5393180.50 |
| 191 | Mexico | 129.739759 | 2742741.25 |
| 215 | United States | 343.477335 | 25641846.00 |
The Alternative
# 1. Build the mask (Population over 100 million in the year 2023)
mask_large_pop = (df['year'] == 2023) & (df['pop'] > 100)
# 2. Apply the mask, and select columns
df_large = df[mask_large_pop][['Country', 'pop', 'rgdpe']]
# See the result
df_large.head()| Country | pop | rgdpe | |
|---|---|---|---|
| 47 | China | 1422.584933 | 30590320.00 |
| 167 | Japan | 124.370947 | 5393180.50 |
| 191 | Mexico | 129.739759 | 2742741.25 |
| 215 | United States | 343.477335 | 25641846.00 |
It works and if you just want to have a quick look at the result it’s perfectly fine. However it’s not the best practice for two reasons:
- It is inefficient: Under the hood, Pandas has to do this in two separate steps. First, it creates a temporary, invisible copy of your entire dataset with just the filtered rows. Then, it runs a second operation to pluck the columns out of that temporary copy.
ChainedAssignmentError: If you ever try to change a value using chained brackets (for example, trying to fix a typo in the GDP column:df[mask]['gdp'] = 100)), it won’t work. Pandas doesn’t know if you want to modify the original dataset or the temporary invisible copy it just made. Older versions of pandas raised theSettingWithCopyWarning.
Sorting
The primary tool is .sort_values()
- Ascending (smallest to largest, default):
df.sort_values(by = 'column_name') - Descending (largest to smallest):
df.sort_values(by = 'column_name', ascending = False)
Example: Which country had the highest GDP in 2023?
latest_data = df[df["year"] == 2023]
top5_gdp = latest_data.sort_values(by="rgdpe", ascending=False)
top5_gdp[['Country', 'rgdpe']].head()| Country | rgdpe | |
|---|---|---|
| 47 | China | 30590320.00 |
| 215 | United States | 25641846.00 |
| 167 | Japan | 5393180.50 |
| 71 | Germany | 5060520.00 |
| 119 | United Kingdom | 3606369.25 |
You’ll notice that it keeps the index of the observation of the original dataframe. If you want to see the ranking clearly you can add .reset_index(drop=True) after sorting. reset_index() will set a new index from 0 to … The argument drop=True ensures that the old index is dropped, without it there would be a column that keeps track of the old index.
latest_data = df[df["year"] == 2023]
top5_gdp = latest_data.sort_values(by="rgdpe", ascending=False).reset_index(drop=True)
top5_gdp[['Country', 'rgdpe']].head()| Country | rgdpe | |
|---|---|---|
| 0 | China | 30590320.00 |
| 1 | United States | 25641846.00 |
| 2 | Japan | 5393180.50 |
| 3 | Germany | 5060520.00 |
| 4 | United Kingdom | 3606369.25 |
Sorting by multiple columns
You can pass a list to the by= argument, to sort by multiple columns, similarly you’ll have to pass a list of True/False values to ascending= if you want to overwrite the default.
# Sort the entire dataset by Country alphabetically, and then by year chronologically
df_sorted = df.sort_values(by=["Country", "year"], ascending=[True, False])
df_sorted[['Country', 'year', 'rgdpe']].head()| Country | year | rgdpe | |
|---|---|---|---|
| 47 | China | 2023 | 30590320.0 |
| 46 | China | 2022 | 29195480.0 |
| 45 | China | 2021 | 28254862.0 |
| 44 | China | 2020 | 25899170.0 |
| 43 | China | 2019 | 25296168.0 |
Does it overwrite?
.sort_values() does not overwrite the dataframe. If you want the dataframe to stay sorted permanently in your computer’s memory, you have to overwrite it: df = df.sort_values(...), or you can use the argument inplace= (False by default) and set it to True.
Aggregation and Grouping
Aggregation
Aggregation means reducing many values down to one summary number—for example the average GDP per capita, the maximum exchange rate, or the number of countries in the dataset.
Common aggregation methods
.sum(): sum of values (numeric columns).mean(): average.median(): median.min()/.max(): smallest and largest value.std()/.var(): standard deviation / variance.nunique(): Number of unique values
Example: Calculate the average GDP (‘rgdpe’) overall, and the average gdp of switzerland
# Average GDP across countries
avg_gdp = df['rgdpe'].mean()
print(avg_gdp)6419583.53978588
# Average GDP Switzerland
avg_gdp_switzerland = df[df['Country']=='Switzerland']['rgdpe'].mean()
print(avg_gdp_switzerland)535882.95703125
Note: df['rgdpe'].mean() averages across country-year observations, to get the average for a country we had to filter the data first. We’ll now look at .groupby() which enables us to compute country-specific averages for all countries at once.
Grouping
groupby() is pandas’ way to do “split → apply → combine”:
- Split the rows into groups based on one or more columns (e.g., by country or by year)
- Apply an operation within each group (mean, sum, count, etc.)
- Combine the results into a new object
Basic Syntax: df.groupby("group_column")["value_column"].agg_function()
Example: Average real GDP by country
df.groupby("Country")["rgdpe"].mean()Country
China 1.697586e+07
France 2.945856e+06
Germany 4.184951e+06
Italy 2.569521e+06
Japan 5.480990e+06
Mexico 2.265468e+06
Switzerland 5.358830e+05
United Kingdom 3.084066e+06
United States 1.973366e+07
Name: rgdpe, dtype: float64
By default, the grouping variable becomes the index of the result. If you want a regular DataFrame instead, you have two common options:
as_index=Falseinsidegroupby(...), or.reset_index()after the aggregation.
# Use as_index = False
avg_gdp_by_country = df.groupby("Country", as_index = False)["rgdpe"].mean()
avg_gdp_by_country.head()| Country | rgdpe | |
|---|---|---|
| 0 | China | 1.697586e+07 |
| 1 | France | 2.945856e+06 |
| 2 | Germany | 4.184951e+06 |
| 3 | Italy | 2.569521e+06 |
| 4 | Japan | 5.480990e+06 |
Example: Total world population by year
# Use reset_index() after the aggregation
world_pop = df.groupby("year")["pop"].sum().reset_index()
world_pop.head()| year | pop | |
|---|---|---|
| 0 | 2000 | 2043.434364 |
| 1 | 2001 | 2057.961651 |
| 2 | 2002 | 2071.574431 |
| 3 | 2003 | 2084.790630 |
| 4 | 2004 | 2098.076959 |
Group by multiple columns
# Average real GDP by Country and year
gdp_country_year = df.groupby(["Country", "year"])["rgdpe"].mean().reset_index()
gdp_country_year.head()| Country | year | rgdpe | |
|---|---|---|---|
| 0 | China | 2000 | 5985309.5 |
| 1 | China | 2001 | 6520996.5 |
| 2 | China | 2002 | 7119989.0 |
| 3 | China | 2003 | 7792220.0 |
| 4 | China | 2004 | 8640414.0 |
Does this make sense here? –> No, for every country, year we only have one observation so aggregation doesn’t make sense
But in general it’s very useful, suppose you have monthly data for different countries and different years, you can group by country and year to create a dataframe with yearly data.
Multiple aggregations at once
Often you want more than just the mean. For example, looking at the minimum, average, and maximum within each group gives a quick sense of the spread of values. Use .agg() for this. You can pass a list to .agg(); for example ["mean", "std"] to get the mean and standard deviation for the variable you select (after grouping).
By default, the grouping variable becomes the index of the result (you can turn it back into a normal column with .reset_index() if you want).
Example: Create a summary of the min, mean and max of human capital (hc) for every country
# Summary of human capital (hc) by country
hc_stats = df.groupby("Country")["hc"].agg(["min", "mean", "max"])
hc_stats.head()| min | mean | max | |
|---|---|---|---|
| Country | |||
| China | 2.313537 | 2.522415 | 2.749776 |
| France | 2.891696 | 3.091695 | 3.312360 |
| Germany | 3.566319 | 3.644270 | 3.705208 |
| Italy | 2.771634 | 3.015264 | 3.265062 |
| Japan | 3.346039 | 3.504695 | 3.656504 |
# Access one country
hc_stats.loc["Switzerland"]min 3.528448
mean 3.639921
max 3.745038
Name: Switzerland, dtype: float64
# Uf you prefer a normal DataFrame (Country as a column):
hc_stats.reset_index().head()| Country | min | mean | max | |
|---|---|---|---|---|
| 0 | China | 2.313537 | 2.522415 | 2.749776 |
| 1 | France | 2.891696 | 3.091695 | 3.312360 |
| 2 | Germany | 3.566319 | 3.644270 | 3.705208 |
| 3 | Italy | 2.771634 | 3.015264 | 3.265062 |
| 4 | Japan | 3.346039 | 3.504695 | 3.656504 |
While passing a list [] to .agg() is great for applying multiple math functions to a single variable, you might want to calculate different statistics for different variables simultaneously.
To do this, we pass a dictionary {} to the .agg() method. The “Key” is the column name, and the “Value” is the function you want to apply.
Basic Syntax
df.groupby('grouping_column').agg({
'variable_1': 'mean',
'variable_2': 'sum',
'variable_3': 'max'
})Example: Aggregate different variables with different methods
year_summary = df.groupby('year').agg({
'pop': 'sum', # Total world population
'hc': 'mean' # Average human capital level
}).reset_index()
year_summary.head()| year | pop | hc | |
|---|---|---|---|
| 0 | 2000 | 2043.434364 | 3.102620 |
| 1 | 2001 | 2057.961651 | 3.117717 |
| 2 | 2002 | 2071.574431 | 3.132900 |
| 3 | 2003 | 2084.790630 | 3.148169 |
| 4 | 2004 | 2098.076959 | 3.163525 |
Changing / Creating Variables
Renaming Variables
Datasets often come with short or cryptic column names. Renaming makes your code easier to read and reduces mistakes.
rename(columns={...})lets you rename one or more columns using a dictionary.- It returns a new DataFrame, so it’s common to reassign:
df = df.rename(...).
Example:
# Rename columns to clearer names
df = df.rename(columns={
"rgdpe": "gdp", # GDP measured in millions
})
df.head()| iso3 | Country | year | hc | pop | gdp | xr | |
|---|---|---|---|---|---|---|---|
| 0 | CHE | Switzerland | 2000 | 3.528448 | 7.184007 | 368548.87500 | 1.688843 |
| 1 | CHE | Switzerland | 2001 | 3.538948 | 7.226391 | 369422.81250 | 1.687615 |
| 2 | CHE | Switzerland | 2002 | 3.549480 | 7.278752 | 366579.53125 | 1.558607 |
| 3 | CHE | Switzerland | 2003 | 3.560044 | 7.333447 | 362704.65625 | 1.346651 |
| 4 | CHE | Switzerland | 2004 | 3.570638 | 7.384194 | 370410.31250 | 1.243496 |
Creating New Variables
A derived variable is a new column computed from existing columns. This is how you change units, create ratios, and build features for analysis. Pandas uses vectorized operations. This means you don’t need to write a loop to calculate math for each row. If you divide one column by another, Pandas automatically does the math row-by-row.
Example: ‘gdp’ is measured in millions USD, create a new variable ‘gdp_bn’ which is in billions
df["gdp_bn"] = df["gdp"] / 1000
df.head()| iso3 | Country | year | hc | pop | gdp | xr | gdp_bn | |
|---|---|---|---|---|---|---|---|---|
| 0 | CHE | Switzerland | 2000 | 3.528448 | 7.184007 | 368548.87500 | 1.688843 | 368.548875 |
| 1 | CHE | Switzerland | 2001 | 3.538948 | 7.226391 | 369422.81250 | 1.687615 | 369.422813 |
| 2 | CHE | Switzerland | 2002 | 3.549480 | 7.278752 | 366579.53125 | 1.558607 | 366.579531 |
| 3 | CHE | Switzerland | 2003 | 3.560044 | 7.333447 | 362704.65625 | 1.346651 | 362.704656 |
| 4 | CHE | Switzerland | 2004 | 3.570638 | 7.384194 | 370410.31250 | 1.243496 | 370.410312 |
Example: create a variable ‘gdp_pc’ (gdp per capita) by dividing ‘gdp’ by population ‘pop’
df["gdp_pc"] = df["gdp"] / df["pop"]
df.head()| iso3 | Country | year | hc | pop | gdp | xr | gdp_bn | gdp_pc | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | CHE | Switzerland | 2000 | 3.528448 | 7.184007 | 368548.87500 | 1.688843 | 368.548875 | 51301.296755 |
| 1 | CHE | Switzerland | 2001 | 3.538948 | 7.226391 | 369422.81250 | 1.687615 | 369.422813 | 51121.342936 |
| 2 | CHE | Switzerland | 2002 | 3.549480 | 7.278752 | 366579.53125 | 1.558607 | 366.579531 | 50362.964867 |
| 3 | CHE | Switzerland | 2003 | 3.560044 | 7.333447 | 362704.65625 | 1.346651 | 362.704656 | 49458.959238 |
| 4 | CHE | Switzerland | 2004 | 3.570638 | 7.384194 | 370410.31250 | 1.243496 | 370.410312 | 50162.592221 |
Saving a DataFrame
After spending time cleaning, filtering, and transforming your data, you don’t want to lose that work. Saving a DataFrame lets you:
- share the cleaned dataset with someone else
- open it in another program (Excel, Stata, R, etc.)
- reload it later without rerunning all your code
In pandas, saving is done with “export” methods. The pattern is: df.to_...("filename")
Save as CSV (most common)
You can save your data frame as a csv file using .to_csv().
CSV is the most universal format—it works in Excel, Google Sheets, R, Stata, and more.
By default, pandas will also save the DataFrame’s index (0, 1, 2, …) as an extra column. Most of the time you don’t want that, so use index=False.
df.head() # Check again if it looks the way it should| iso3 | Country | year | hc | pop | gdp | xr | gdp_bn | gdp_pc | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | CHE | Switzerland | 2000 | 3.528448 | 7.184007 | 368548.87500 | 1.688843 | 368.548875 | 51301.296755 |
| 1 | CHE | Switzerland | 2001 | 3.538948 | 7.226391 | 369422.81250 | 1.687615 | 369.422813 | 51121.342936 |
| 2 | CHE | Switzerland | 2002 | 3.549480 | 7.278752 | 366579.53125 | 1.558607 | 366.579531 | 50362.964867 |
| 3 | CHE | Switzerland | 2003 | 3.560044 | 7.333447 | 362704.65625 | 1.346651 | 362.704656 | 49458.959238 |
| 4 | CHE | Switzerland | 2004 | 3.570638 | 7.384194 | 370410.31250 | 1.243496 | 370.410312 | 50162.592221 |
# Save the cleaned dataframe to a CSV file
df.to_csv("data/cleaned_countries_data.csv", index=False)You can save to other formats as well - to_excel() - to_stata()
Save to Pickle
In Python, pickling is the process of saving a Python object (like a pandas DataFrame) as a file by converting it into a stream of bytes. When you load it back, Python reconstructs the object exactly.
Why use Pickle?
- Speed: Reading/writing is often much faster than CSV for larger datasets.
- Preserves metadata: It keeps your DataFrame exactly as it is (data types, datetime columns, index, etc.).
- File size: It’s often smaller than CSV, and you can also save a compressed pickle if you want.
Syntax:
- Save to pickle syntax:
df.to_pickle("file.pkl") - When you read it back later syntax: `pd.read_pickle(“file.pkl”)
Warning: Only open pickle files from people you trust. A malicious pickle can execute code when it is loaded. For sharing data widely, prefer formats like CSV (or other non-executable data formats).
# Save the dataframe as a Pickle file
df.to_pickle("data/cleaned_countries_data.pkl")# Read it back later
test_pkl_df = pd.read_pickle("data/cleaned_countries_data.pkl")
test_pkl_df.head()| iso3 | Country | year | hc | pop | gdp | xr | gdp_bn | gdp_pc | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | CHE | Switzerland | 2000 | 3.528448 | 7.184007 | 368548.87500 | 1.688843 | 368.548875 | 51301.296755 |
| 1 | CHE | Switzerland | 2001 | 3.538948 | 7.226391 | 369422.81250 | 1.687615 | 369.422813 | 51121.342936 |
| 2 | CHE | Switzerland | 2002 | 3.549480 | 7.278752 | 366579.53125 | 1.558607 | 366.579531 | 50362.964867 |
| 3 | CHE | Switzerland | 2003 | 3.560044 | 7.333447 | 362704.65625 | 1.346651 | 362.704656 | 49458.959238 |
| 4 | CHE | Switzerland | 2004 | 3.570638 | 7.384194 | 370410.31250 | 1.243496 | 370.410312 | 50162.592221 |
Visualization with Matplotlib
Installation / Import
- Installation: If you’re using it for the first time you need to install it. Run
uv add matplotlibin the terminal - Import the package
import matplotlib.pyplot as plt
matplotlib: The main plotting library. It provides the core machinery for creating figures and rendering graphics.matplotlib.pyplot: A user-friendly module that gives a simple “plotting commands” interface (it manages the current figure/axes for you).plt: The conventional nickname
import matplotlib.pyplot as pltYour First Plot
Let’s make a simple line plot of GDP per capita (gdp_pc) over time for Switzerland.
Basic Syntax:
plt.plot(x, y)to draw the plotplt.show()to display it
Matplotlib plots whatever x–y values you give it, so we first create a smaller DataFrame that only contains Switzerland. x values will be ‘year’ and the y values will be ‘gdp_pc’. For a line plot, it’s important that the x-axis (time) is sorted.
# Filter to Switzerland and sort by year
df_switzerland = df[df["Country"] == "Switzerland"]
df_switzerland = df_switzerland.sort_values("year")
# df_switzerland.head()# Plot year (x-axis) vs gdp_pc (y-axis)
plt.plot(df_switzerland["year"], df_switzerland["gdp_pc"])
# Show the plot
plt.show()Without labels, a plot is just a mystery line on a grid. You should add the relevant context
plt.xlabel('some text'): Add a label for the x axis, for example ‘Year’plt.ylabel('some text'): Add a label for the y axis, for example ‘GDP per Capita (mio. USD)’plt.title('some text'): Add a title: the title should be descriptive enough that a reader can understand the chart
# Create the plot
# Plot year (x-axis) vs gdp_pc (y-axis)
plt.plot(df_switzerland["year"], df_switzerland["gdp_pc"])
# NEW: Add labels so the plot is readable
plt.title("GDP per capita over time — Switzerland")
plt.xlabel("Year")
plt.ylabel("GDP per capita (mio. USD)")
# Show the plot
plt.show()Some useful layout methods:
- Size: You can initizalize the figure with
plt.figure()and determine the size using the argumentfigsize = (width, height) - Grid: You can add a grid, which can help the reader to map a specific year to a specific value using
plt.grid() - Layout: Use
plt.tight_layout()before you show the plot. It automatically adjusts the spacing in your figure so that things like the title, axis labels, and tick labels don’t overlap or get cut off.
# NEW: initialize and set a size
plt.figure(figsize=(8, 5))
# Plot year (x-axis) vs gdp_pc (y-axis)
plt.plot(df_switzerland["year"], df_switzerland["gdp_pc"])
# Add labels so the plot is readable
plt.title("GDP per capita over time — Switzerland")
plt.xlabel("Year")
plt.ylabel("GDP per capita (mio. USD)")
# NEW: Add a grid
plt.grid()
# Show the plot
# NEW: tight layout
plt.tight_layout()
plt.show()Types of Plots
Here are some of the most common plot types in Matplotlib, what they’re for, and the basic syntax
- Line plot (trend over time or another ordered variable)
Syntax:plt.plot(x, y) - Scatter plot (relationship between two numeric variables; each point is one observation)
Syntax:plt.scatter(x, y) - Bar chart (compare values across categories)
Syntax:plt.bar(categories, values)for vertical bars
Syntax:plt.barh(categories, values)for horizontal bars. - Histogram (distribution of a single numeric variable)
Syntax:plt.hist(x, bins=...) - Box plot (distribution + median + spread; easy comparison across groups)
Syntax:plt.boxplot(data) - Pie chart (shares of a whole; generally best for a small number of categories)
Syntax:plt.pie(values, labels=labels)
Bar Chart
A bar chart is used when you want to compare a single numeric value across different categorie (for example, a value per country, sector, or group).
- Each bar represents one category
- The length/height of the bar represents the value
- Bar charts are easiest to read when categories are sorted.
- For long category names (like country names), a horizontal bar chart (plt.barh) is often more readable than a vertical one.
Syntax:
- Vertical:
plt.bar(x_categories, y_values) - Horizontal:
plt.barh(y_categories, x_values)
Example: Human capital index ‘hc’ of the different countries in 2023
# 1) Filter the data for the year 2023 and sort countries by hc
df_2023 = df[df["year"] == 2023].sort_values("hc", ascending=True)
# 2) Initialize the figure (controls plot size)
plt.figure(figsize=(8, 4.5))
# 3) Create a horizontal bar chart
plt.barh(df_2023["Country"], df_2023["hc"])
# 4) Add labels so the chart is interpretable
plt.title("Human Capital Index by Country (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("") # optional: remove label if it's obvious from the country names
# 5) Final layout and display
plt.tight_layout()
plt.show()Scatterplot
A scatter plot is used to visualize the relationship between two numeric variables. Each dot represents one observation (for example: one country in one year).
Scatter plots are useful for spotting: - Positive or negative relationships (do values tend to rise together?) - Clusters (groups of points that look similar) - Outliers (unusual observations)
Basic Syntax: plt.scatter(x, y)
Example: Create a scatterplot for human capital index (‘hc’) and GDP per capita (‘gdp_pc’) for all countries and years
plt.figure(figsize=(8, 6))
# Use the full dataframe 'df'
plt.scatter(df['hc'], df['gdp_pc'])
# Add Context
plt.title('Human Capital vs. GDP per Capita: All Countries, All Years (2000-2023)')
plt.xlabel('Human Capital Index')
plt.ylabel('GDP per Capita (USD)')
plt.grid(True, linestyle='--', alpha=0.2)
plt.tight_layout()
plt.show()Histogram
A Histogram is the go-to tool for understanding the “shape” of your data. A histogram shows the distribution of values across your data. We use histograms to see the frequency of certain outcomes. It takes a continuous variable (like GDP growth) and splits it into “bins” (ranges). It then counts how many data points fall into each bin.
Syntax: plt.hist(data, bins=10)
data: The column of your dataframe for which you want to see the distribution of valuesbins: The number of bins.
Useful additional arguments in plt.hist(): - color: define the color that fills the bar - edgecolor: define the color of the border, usually white or black. Helps to separate the bars - …
plt.figure(figsize=(8, 5))
# Use the 'gdp_growth' column
# .dropna() is important because histograms can't handle missing values
plt.hist(df['gdp_pc'].dropna(), bins=30, color='steelblue', edgecolor='white')
# Add Context
plt.title('Distribution of GDP per Capita (2000-2023)', fontsize=14)
plt.xlabel('GDP per Capita', fontsize=12)
plt.ylabel('Frequency (Number of Occurrences)', fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.show()Next Week
Next week we’ll build on what we learned today and go a step further:
- Time series data: working with dates and time, and plotting trends over time more effectively
- Messier real-world data: handling missing values and other common data issues
- Customizing plots: improving readability with better labels, formatting, scales, and styling options
- Reusable plotting functions: writing Python functions that create figures so you can make the same plot quickly for different countries/variables