4,222 Introduction to Programming

Week 5: Python for Data - Part I

Author

Franziska Bender

Published

March 14, 2026

Before We Begin:

Installing Packages

In Python, we use external “packages” to do heavy lifting—like analyzing data or drawing graphs. Since Python doesn’t come with these built-in, we need to download them.

We will use uv to download these packages and formally add them to our project. We need four core packages today:

  • numpy: A math package that works behind the scenes to make data calculations fast.
  • pandas: For working with datasets (like Excel or CSV files).
  • matplotlib: For drawing graphs and charts.
  • ipykernel: The “engine” that allows VS Code to run Python code inside this notebook.

To install them open a new terminal, type:

uv add ipykernel pandas numpy matplotlib

and press Enter.

(Note: Some of these packages you have probably installed previously. Don’t worry, uv will check what is already installed and only add what isn’t.)

Recommended Folder Structure

If you want to work through this notebook yourself, test the code, and also experiment with it, we recommend extending the folder structure we set up previously. Set up a folder for week_05. Put your Jupyter notebook (my_analysis_week05.ipynb) there, and create a data subfolder where you save the dataset “countries_data.csv”.

Introduction_to_programming/
├── github_course_materials/ 
├── exercises/              
│   ├── ...
├── group_project/
│   ├── ...
└── lectures/                
    └── week_05/             # <--- NEW: Create this folder
        ├── data/            # <--- NEW: Create this folder
        │   └── countries_data.csv      # NEW: Save the dataset
        └── my_analysis_week05.ipynb

(Note: You can create your own folder structure if you prefer, but remember to adjust the file paths in your code later when we load or save dataframes!)

Pandas

Pandas is Python’s go-to library for working with tabular data—the kind you typically see in spreadsheets or statistical software. It gives you fast, convenient tools to

  • import data (CSV, Excel, SQL),
  • clean and reshape it (handle missing values, filter rows, create new variables, merge datasets), and
  • summarize it (grouping, descriptive statistics)

in a way that’s both readable and reproducible. For economics work, pandas is especially useful because it makes it easy to build a transparent data pipeline: from raw data to analysis-ready datasets, and then on to figures and results you can trust and rerun.

import numpy as np
import pandas as pd

When we import these packages, you will almost always see as np and as pd. Think of this as creating an official, time-saving nickname.

  • By writing import pandas as pd, we tell Python: “Whenever I type pd, I mean pandas.”
  • It saves typing: You get to use pd.read_csv() instead of the full name.
  • It is the industry standard: pd and np are the universal conventions in the Python data science and economics community. Using these specific abbreviations ensures your code is instantly readable to anyone else sharing your work.

Numpy

NumPy (Numerical Python) is a math package that works behind the scenes to make data calculations fast. If Pandas is the spreadsheet, NumPy is the underlying calculator. In fact, Pandas is actually built on top of NumPy. For many tasks Pandas will use NumPy in the background without you needing to import it explicitly. But occasionally we need to do math that Pandas doesn’t handle on its own. Since they are the dynamic duo of Python data analysis, it is standard practice to just import both at the very top of your script so you have all your tools ready to go.

Reading Data

To get data into your workspace, we use reader functions. Since a lot of economic data comes as CSV (Comma-Separated Values) files, pd.read_csv() will likely be your most-used command.

Basic Syntax:

df = pd.read_csv("path/to/your/file.csv")

When you run this, Pandas does three things:

  • Finds the file on your computer (based on the path you give it).
  • Splits the text into columns using a delimiter (usually a comma).
  • Builds a DataFrame (a table-like object) and stores it in df (a common convention).

Real-world economic data is often messy. These options can rescue you when the default settings don’t work:

  • sep: If your file uses semicolons instead of commas, use sep=';'
  • index_col: If the first column is a “Year” or “ID” and you want it to be the label for your rows, use index_col=0.
  • header: If your data starts on the 3rd row because there are two rows of titles/notes at the top, use header=2.

If your data is not in csv format there are different functions to read the data:

  • pd.read_excel(): For excel files (.xlsx or .xls). It Unlike a CSV, an Excel file can have multiple sheets. You can specify the name (as a string) or the position (e.g., 0 for the first sheet using the argument sheet_name).
  • pd.read_stata(): For stata files (.dta)

Example: We’ll work with a dataset “countries_data.csv” located in the folder “data”. Read the data

# Load the dataset
df = pd.read_csv("data/countries_data.csv")

After reading a dataset, always do a quick check that it worked. Datasets can be large, so you usually don’t want to print the entire table. Your safest first preview is: df.head()

# Preview of data
df.head()
iso3 Country year hc pop rgdpe xr
0 CHE Switzerland 2000 3.528448 7.184007 368548.87500 1.688843
1 CHE Switzerland 2001 3.538948 7.226391 369422.81250 1.687615
2 CHE Switzerland 2002 3.549480 7.278752 366579.53125 1.558607
3 CHE Switzerland 2003 3.560044 7.333447 362704.65625 1.346651
4 CHE Switzerland 2004 3.570638 7.384194 370410.31250 1.243496

This shows the first few rows so you can confirm that:

  • columns are aligned correctly,
  • headers look right,
  • and the values “make sense” at first glance.

FileNotFoundError

%xmode Minimal
df = pd.read_csv("my_data/countries_data.csv")
Exception reporting mode: Minimal
FileNotFoundError: [Errno 2] No such file or directory: 'my_data/countries_data.csv'

you might get a massive block of red text ending in FileNotFoundError. Don’t panic! This simply means Python cannot find the file at the path you provided.

  1. Check your spelling: Ensure the file name is exactly “countries_data.csv” and the folder is exactly “data/”.
  2. Check your Working Directory: Python looks for the file relative to where you are

If you ever get completely lost, you can always run a cell

import os          
os.getcwd()
'C:\\Users\\aurel\\OneDrive\\Documents\\BEcon_4222_Introduction_Programming\\course_materials\\week_05'

to see exactly which folder Python thinks it is in right now!

Navigating folder paths:

  • "data/countries_data.csv": This tells Python, “Look inside my current location for a folder named data, and grab the file.”
  • "../": This tells Python, “Take one step backward (up) out of the current folder.” So if you have a folder data that is not in week05 but one level above, this is how you can access it.

A First Look at Your Data

The Data

In this lecture we work with data from the Penn World Table. The Penn World Table (PWT) is a cross-country macroeconomic dataset that provides internationally comparable information on income, output, inputs, prices, and productivity. The current PWT 11.0 release covers 185 countries from 1950 to 2023 and is designed to make economic comparisons across countries and over time using purchasing power parity (PPP) methods.

The variables we use in this class to explore the functionality of Pandas are:

Variable Full Name / Definition
iso3 3-letter ISO Country Code
Country Full Country Name
year Observation Year
hc Human capital index (based on years of schooling)
pop Population (in millions)
rgdpe Expenditure-side real GDP at chained PPPs (in mil. 2021US$)
xr Exchange rate, national currency/USD (market+estimated)

There are many other variables (and countries) available, you can download data from the PWT here

Inspect the DataFrame

Once your data is in a DataFrame, your first step is to inspect it. Pandas provides several built-in methods and attributes that help you quickly understand what you’re working with.

Methods (you call them)

Methods are functions attached to an object (here df), you call them using object.method(), (with parentheses!)

  • df.head(): shows the first rows of the DataFrame (default: 5)
  • df.tail():shows the first rows of the DataFrame (default: 5)
  • df.info(): shows the number of rows and columns, non-null counts (missing data), and data types
  • df.describe(): for numeric columns, gives summary statistics (count, mean, std, min, quartiles, max)

Attributes (you can access them)

Attributes are stored information, so you don’t use parentheses.

  • df.shape: a tuple (number of rows, number of columns)
  • df.columns: the column names (an index-like object)
# Why use it: 
# To quickly verify that the data loaded correctly, headers are aligned, 
# and the values make sense at first glance
df.head()
iso3 Country year hc pop rgdpe xr
0 CHE Switzerland 2000 3.528448 7.184007 368548.87500 1.688843
1 CHE Switzerland 2001 3.538948 7.226391 369422.81250 1.687615
2 CHE Switzerland 2002 3.549480 7.278752 366579.53125 1.558607
3 CHE Switzerland 2003 3.560044 7.333447 362704.65625 1.346651
4 CHE Switzerland 2004 3.570638 7.384194 370410.31250 1.243496
# Why use it:
# Sometimes the data hasfootnotes or summary rows appended to the very bottom of the CSV. 
# This helps you catch and remove that "junk" data before running your analysis.
df.tail()
iso3 Country year hc pop rgdpe xr
211 USA United States 2019 3.749341 337.790067 23200364.0 1.0
212 USA United States 2020 3.754667 339.436159 22676858.0 1.0
213 USA United States 2021 3.728173 340.161441 24167636.0 1.0
214 USA United States 2022 3.813643 341.534046 24861132.0 1.0
215 USA United States 2023 3.829826 343.477335 25641846.0 1.0
# Why use it: 
# This is your structural overview. It lets you instantly spot missing data or 
# check if a numeric column like GDP was accidentally imported as text
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   iso3     216 non-null    object 
 1   Country  216 non-null    object 
 2   year     216 non-null    int64  
 3   hc       216 non-null    float64
 4   pop      216 non-null    float64
 5   rgdpe    216 non-null    float64
 6   xr       216 non-null    float64
dtypes: float64(4), int64(1), object(2)
memory usage: 11.9+ KB

When you run df.info(), you will see a column called Dtype. Pandas uses slightly different names for data types than standard Python. Pandas also has additional datatypes (such as datetime64) which we will cover in the next lecture.

# Why use it:
# Get a sense of the values in your DataFrame, 
# Quick sanity check—for example, ensuring population values aren't negative 
# or that max GDP isn't unrealistically high.
df.describe()
year hc pop rgdpe xr
count 216.000000 216.000000 216.000000 2.160000e+02 216.000000
mean 2011.500000 3.270894 244.145055 6.419584e+06 15.110337
std 6.938266 0.454432 404.613603 7.085834e+06 33.988611
min 2000.000000 2.313537 7.184007 3.627047e+05 0.499772
25% 2005.750000 2.889263 60.954363 2.528470e+06 0.803922
50% 2011.500000 3.508762 82.057528 3.286765e+06 1.000000
75% 2017.250000 3.657557 128.089869 5.579467e+06 8.028658
max 2023.000000 3.846449 1426.437267 3.059032e+07 140.491100
# Why use it: 
# Confirm your dataset is the size you expect
df.shape
(216, 7)
# Why use it: 
# Find out what variables you're working with.
df.columns
Index(['iso3', 'Country', 'year', 'hc', 'pop', 'rgdpe', 'xr'], dtype='object')

As you learn Python, you will notice three different ways we interact with data. Understanding the difference now will save you a lot of syntax errors later!

  • Functions (Standalone Tools): These are tools that live on their own. You bring your data into the tool.
    • Syntax: function_name(object)
    • Example: print(df) or len(df)
  • Methods (Built-in Actions): These are functions attached directly to an object (like a DataFrame). You are asking the dataset to do something to itself. Because it is an action, it always requires parentheses ().
    • Syntax: object.method_name()
    • Example: df.head() or df.describe()
  • Attributes (Stored Facts): These are pieces of information or metadata saved inside the object. You are not asking the dataset to calculate anything new; you are just reading a fact. Because it is not an action, there are no parentheses.
    • Syntax: object.attribute_name
    • Example: df.shape or df.columns

Selecting Data

Selecting Columns

  • To grab one column and return a Series, use square brackets [] with the column name (a string):
  • To grab multiple columns, pass a list of names inside the brackets. This is why you’ll see double brackets [[]] (outer [] for pandas selection, inner [] for the list). This returns a DataFrame (even if the list has only one column):

Quick rule: “one name” –> Series, [“list of names”] –> DataFrame.

Examples:

df['year']              # return 'year' column as a series
0      2000
1      2001
2      2002
3      2003
4      2004
       ... 
211    2019
212    2020
213    2021
214    2022
215    2023
Name: year, Length: 216, dtype: int64
df[['year', 'pop']]     # returns 'year' and 'pop' column as a DataFrame
year pop
0 2000 7.184007
1 2001 7.226391
2 2002 7.278752
3 2003 7.333447
4 2004 7.384194
... ... ...
211 2019 337.790067
212 2020 339.436159
213 2021 340.161441
214 2022 341.534046
215 2023 343.477335

216 rows × 2 columns

Once you select a single column (a Series), you can use specific methods to explore it. For example,

  • .nunique(): Returns the number of unique values. (e.g., “Exactly how many different countries are in my dataset?”)
  • .unique(): Returns an array of the actual unique values. (e.g., “Give me a list of the names of all the countries included.”)
  • .value_counts(): Returns a count of how many times each unique value appears, sorted from most frequent to least frequent. (e.g., “How many years of data do I have for each country?”)
# How many unique countries are in the dataset?
df['Country'].nunique()
9
# What are the names of those countries?
df['Country'].unique()
array(['Switzerland', 'China', 'Germany', 'France', 'United Kingdom',
       'Italy', 'Japan', 'Mexico', 'United States'], dtype=object)
# How many observations (years) do we have for each country?
df['Country'].value_counts()
Country
Switzerland       24
China             24
Germany           24
France            24
United Kingdom    24
Italy             24
Japan             24
Mexico            24
United States     24
Name: count, dtype: int64

Selecting rows/columns by position: .iloc[]

.iloc[] selects by integer position, like standard Python indexing.

  • Syntax: df.iloc[row_position, column_position]
  • Counting starts at 0
  • Slices are end-exclusive (just like Python): 0:3 gives rows 0, 1, 2

Examples:

df.iloc[0, 0]      # first row, first column
'CHE'
df.iloc[0:3, 0:2]  # rows 0-2 and columns 0-1
iso3 Country
0 CHE Switzerland
1 CHE Switzerland
2 CHE Switzerland

Selecting rows/columns by label: .loc[]

.loc[] selects by labels (names) from the index and columns.

  • Syntax: df.loc[row_label, column_label]
df.loc[0, "rgdpe"]                 # if the index labels include 0
368548.875

Important difference: label-based slicing with .loc is usually end-inclusive:

# first 6 rows (0,1,2,3,4,5) and column 'rgdpe'
df.loc[0:5, "rgdpe"]
0    368548.87500
1    369422.81250
2    366579.53125
3    362704.65625
4    370410.31250
5    385585.37500
Name: rgdpe, dtype: float64

For both .loc and .iloc, using :(instead of label or index of columns or rows) means “everything” along that dimension:

df.loc[:,['year', 'pop']].head()
year pop
0 2000 7.184007
1 2001 7.226391
2 2002 7.278752
3 2003 7.333447
4 2004 7.384194

Pandas gives you several ways to subset your data, which can feel overwhelming. Here is the cheat sheet for what you should actually use in your research scripts:

  • Use [[]] (Standard Brackets) when you ONLY need columns: If you just want to isolate a few variables this is the cleanest way.
    • Example: df_small = df[['year', 'pop', 'gdp']]
  • Use .loc[] (Label-based) as your default for rows + columns:
    • If you need to select specific rows and columns, always prefer .loc. Because it uses the names of the columns, your code is robust. If the dataset is updated next year and adds a new column at the beginning, your .loc code will still grab the correct variables.
  • Use .iloc[] (Position-based) sparingly:
    • Selecting data by its integer position (e.g., “give me the 4th column”) is risky in economics because datasets change. Use .iloc only when position actually matters, such as grabbing the top 3 rows after sorting a dataset.

Filtering

Filtering - also called Boolean indexing - is one of the most important pandas skills. In research, you almost never use the full “universe” of data. Instead you focus on specific countries, certain time periods, recession years, or groups defined by a threshold (e.g., high inflation, high debt-to-GDP).

Filtering is a two-step-process:

    1. The Comparison (build a mask): When you compare a column to a value, pandas doesn’t return one answer. It returns a True/False value for every row. This True/False Series is called a mask (it has the same length as the DataFrame).
    1. The Selection/Filter (apply the mask): You can then use the mask in df[] and it will only return the rows for which the mask is True

Example: keep only observations from 2000 onward

# Step 1: Build a mask
mask = df["year"] >= 2010  
mask
0      False
1      False
2      False
3      False
4      False
       ...  
211     True
212     True
213     True
214     True
215     True
Name: year, Length: 216, dtype: bool
# Step 2: Apply the mask
df_2010 = df[mask]        
df_2010.head()              
iso3 Country year hc pop rgdpe xr
10 CHE Switzerland 2010 3.634872 7.825753 496048.96875 1.042906
11 CHE Switzerland 2011 3.642345 7.914372 552769.12500 0.888042
12 CHE Switzerland 2012 3.649832 7.998926 571063.31250 0.937684
13 CHE Switzerland 2013 3.657335 8.091202 568455.56250 0.926904
14 CHE Switzerland 2014 3.664854 8.190235 580136.00000 0.916151

Once you are comfortable with masks, you will often skip creating the ‘mask’ variable and put the condition directly inside the brackets:

df_recent = df[df["year"] >= 2015]

This is perfectly fine, especially for simple filtering, but creating a mask variable explicitly can improve readability if the filtering is more complex.

Comparison tools you’ll use all the time

  • ==: equal to
  • !=: not equal to
  • >/ <: greater or less than
  • >= / <=: greater than or equal to / less than or equal to
  • .isin([...]) membership in a list of values (very common for countries)

Example: Keep a set of countries

mask = df["iso3"].isin(['CHE', 'FRA', 'DEU', 'ITA'])    # 1. build a mask
df_europe = df[mask]                                    # 2. apply the mask
df_europe['Country'].value_counts()                     # did it work?
Country
Switzerland    24
Germany        24
France         24
Italy          24
Name: count, dtype: int64

Note: if use filter your dataframe df[mask] it creates a filtered view/result, but it does not overwrite df. To keep it you either (i) assign it to a new name df_filtered = df[mask] or (ii) you overwrite df=df[mask], which makes it permanent for your session.

Usually it’s better to create a new object name (df_eurupe, df_che, etc.) so you don’t “lose” data by accident.

Filtering with multiple conditions (combining masks)

Often you want to filter using more than one criterion—for example, a set of countries and only years after 2015.

  • Condition1: df['iso3'].isin(['FRA', 'DEU', 'CHE'])
  • Condition2: df['year'] > 2015

You can combine them with logical operators:

  • & (AND): Both conditions must be true
  • | (OR): At least one condition must be True.
  • ~ (NOT): Flips the mask, True becomes False and vice versa. So if it’s easier to define what you do not want to have in your dataframe you can use this.

Note: for lists we used and/or, but in pandas we use &/ |.

You must wrap each condition in parentheses ():

mask = (condition1) & (condition2)

Example: Filter observations for france, germany and switzerland after 2015

# example: mask for france, germany and switzerland for the years after 2015
mask = (df['iso3'].isin(['FRA', 'DEU', 'CHE'])) & (df['year'] > 2015)
df_filtered = df[mask]

Example: Filter for years after 2015 for all countries except USA and China

# example: mask for years after 2015 for all countries except USA and China
countries_to_exclude = ['China', 'United States']
mask = (~df['Country'].isin(countries_to_exclude)) & (df['year'] > 2015)
# use the mask to filter, select 'Country' and check the countries remaining: 
df_filtered = df[mask]

# Check whether USA and China are still in there: 
df_filtered['Country'].value_counts()
Country
Switzerland       8
Germany           8
France            8
United Kingdom    8
Italy             8
Japan             8
Mexico            8
Name: count, dtype: int64

Example with Error

# INCORRECT: Using 'and' instead of '&', and forgetting parentheses
mask = df['iso3'] == 'CHE' and df['year'] > 2015
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

If you get an error when you create the mask, check your filtering code:

  • Did you use and/or instead of &/|?
  • Did you forget the parentheses () around each individual condition?

Filtering and Selecting Simultaneously

So far, we have used df[mask] to filter our dataset. This returns all the columns for the rows that meet our condition.

But what if you only want to see the Country name and GDP for years where the population was over 100 million?

This is where .loc is actually the best tool for the job. You pass the mask as the row selector, and a list of columns as the column selector: df.loc[mask, columns]

# 1. Build the mask (Population over 100 million in the year 2023)
mask_large_pop = (df['year'] == 2023) & (df['pop'] > 100)

# 2. Define the exact columns you want to see
cols_to_keep = ['Country', 'pop', 'rgdpe']

# 3. Use .loc to apply the mask to the rows, and select the columns simultaneously
df_large = df.loc[mask_large_pop, cols_to_keep]

# See the result
df_large.head()
Country pop rgdpe
47 China 1422.584933 30590320.00
167 Japan 124.370947 5393180.50
191 Mexico 129.739759 2742741.25
215 United States 343.477335 25641846.00

The Alternative

# 1. Build the mask (Population over 100 million in the year 2023)
mask_large_pop = (df['year'] == 2023) & (df['pop'] > 100)

# 2. Apply the mask, and select columns 
df_large = df[mask_large_pop][['Country', 'pop', 'rgdpe']]

# See the result
df_large.head()
Country pop rgdpe
47 China 1422.584933 30590320.00
167 Japan 124.370947 5393180.50
191 Mexico 129.739759 2742741.25
215 United States 343.477335 25641846.00

It works and if you just want to have a quick look at the result it’s perfectly fine. However it’s not the best practice for two reasons:

  1. It is inefficient: Under the hood, Pandas has to do this in two separate steps. First, it creates a temporary, invisible copy of your entire dataset with just the filtered rows. Then, it runs a second operation to pluck the columns out of that temporary copy.
  2. ChainedAssignmentError: If you ever try to change a value using chained brackets (for example, trying to fix a typo in the GDP column: df[mask]['gdp'] = 100)), it won’t work. Pandas doesn’t know if you want to modify the original dataset or the temporary invisible copy it just made. Older versions of pandas raised the SettingWithCopyWarning.

Sorting

The primary tool is .sort_values()

  • Ascending (smallest to largest, default): df.sort_values(by = 'column_name')
  • Descending (largest to smallest): df.sort_values(by = 'column_name', ascending = False)

Example: Which country had the highest GDP in 2023?

latest_data = df[df["year"] == 2023]

top5_gdp = latest_data.sort_values(by="rgdpe", ascending=False)

top5_gdp[['Country', 'rgdpe']].head()
Country rgdpe
47 China 30590320.00
215 United States 25641846.00
167 Japan 5393180.50
71 Germany 5060520.00
119 United Kingdom 3606369.25

You’ll notice that it keeps the index of the observation of the original dataframe. If you want to see the ranking clearly you can add .reset_index(drop=True) after sorting. reset_index() will set a new index from 0 to … The argument drop=True ensures that the old index is dropped, without it there would be a column that keeps track of the old index.

latest_data = df[df["year"] == 2023]

top5_gdp = latest_data.sort_values(by="rgdpe", ascending=False).reset_index(drop=True)

top5_gdp[['Country', 'rgdpe']].head()
Country rgdpe
0 China 30590320.00
1 United States 25641846.00
2 Japan 5393180.50
3 Germany 5060520.00
4 United Kingdom 3606369.25

Sorting by multiple columns

You can pass a list to the by= argument, to sort by multiple columns, similarly you’ll have to pass a list of True/False values to ascending= if you want to overwrite the default.

# Sort the entire dataset by Country alphabetically, and then by year chronologically
df_sorted = df.sort_values(by=["Country", "year"], ascending=[True, False])
df_sorted[['Country', 'year', 'rgdpe']].head()
Country year rgdpe
47 China 2023 30590320.0
46 China 2022 29195480.0
45 China 2021 28254862.0
44 China 2020 25899170.0
43 China 2019 25296168.0

Does it overwrite?

.sort_values() does not overwrite the dataframe. If you want the dataframe to stay sorted permanently in your computer’s memory, you have to overwrite it: df = df.sort_values(...), or you can use the argument inplace= (False by default) and set it to True.

Aggregation and Grouping

Aggregation

Aggregation means reducing many values down to one summary number—for example the average GDP per capita, the maximum exchange rate, or the number of countries in the dataset.

Common aggregation methods

  • .sum(): sum of values (numeric columns)
  • .mean(): average
  • .median(): median
  • .min()/ .max(): smallest and largest value
  • .std() / .var(): standard deviation / variance
  • .nunique(): Number of unique values

Example: Calculate the average GDP (‘rgdpe’) overall, and the average gdp of switzerland

# Average GDP across countries
avg_gdp = df['rgdpe'].mean()
print(avg_gdp)
6419583.53978588
# Average GDP Switzerland
avg_gdp_switzerland = df[df['Country']=='Switzerland']['rgdpe'].mean()
print(avg_gdp_switzerland)
535882.95703125

Note: df['rgdpe'].mean() averages across country-year observations, to get the average for a country we had to filter the data first. We’ll now look at .groupby() which enables us to compute country-specific averages for all countries at once.

Grouping

groupby() is pandas’ way to do “split → apply → combine”:

  1. Split the rows into groups based on one or more columns (e.g., by country or by year)
  2. Apply an operation within each group (mean, sum, count, etc.)
  3. Combine the results into a new object

Basic Syntax: df.groupby("group_column")["value_column"].agg_function()

Example: Average real GDP by country

df.groupby("Country")["rgdpe"].mean()
Country
China             1.697586e+07
France            2.945856e+06
Germany           4.184951e+06
Italy             2.569521e+06
Japan             5.480990e+06
Mexico            2.265468e+06
Switzerland       5.358830e+05
United Kingdom    3.084066e+06
United States     1.973366e+07
Name: rgdpe, dtype: float64

By default, the grouping variable becomes the index of the result. If you want a regular DataFrame instead, you have two common options:

  • as_index=False inside groupby(...), or
  • .reset_index() after the aggregation.
# Use as_index = False
avg_gdp_by_country = df.groupby("Country", as_index = False)["rgdpe"].mean()
avg_gdp_by_country.head()
Country rgdpe
0 China 1.697586e+07
1 France 2.945856e+06
2 Germany 4.184951e+06
3 Italy 2.569521e+06
4 Japan 5.480990e+06

Example: Total world population by year

# Use reset_index() after the aggregation
world_pop = df.groupby("year")["pop"].sum().reset_index()
world_pop.head()
year pop
0 2000 2043.434364
1 2001 2057.961651
2 2002 2071.574431
3 2003 2084.790630
4 2004 2098.076959

Group by multiple columns

# Average real GDP by Country and year
gdp_country_year = df.groupby(["Country", "year"])["rgdpe"].mean().reset_index()
gdp_country_year.head()
Country year rgdpe
0 China 2000 5985309.5
1 China 2001 6520996.5
2 China 2002 7119989.0
3 China 2003 7792220.0
4 China 2004 8640414.0

Does this make sense here? –> No, for every country, year we only have one observation so aggregation doesn’t make sense

But in general it’s very useful, suppose you have monthly data for different countries and different years, you can group by country and year to create a dataframe with yearly data.

Multiple aggregations at once

Often you want more than just the mean. For example, looking at the minimum, average, and maximum within each group gives a quick sense of the spread of values. Use .agg() for this. You can pass a list to .agg(); for example ["mean", "std"] to get the mean and standard deviation for the variable you select (after grouping).

By default, the grouping variable becomes the index of the result (you can turn it back into a normal column with .reset_index() if you want).

Example: Create a summary of the min, mean and max of human capital (hc) for every country

# Summary of human capital (hc) by country
hc_stats = df.groupby("Country")["hc"].agg(["min", "mean", "max"])
hc_stats.head()
min mean max
Country
China 2.313537 2.522415 2.749776
France 2.891696 3.091695 3.312360
Germany 3.566319 3.644270 3.705208
Italy 2.771634 3.015264 3.265062
Japan 3.346039 3.504695 3.656504
# Access one country
hc_stats.loc["Switzerland"]
min     3.528448
mean    3.639921
max     3.745038
Name: Switzerland, dtype: float64
# Uf you prefer a normal DataFrame (Country as a column):
hc_stats.reset_index().head()
Country min mean max
0 China 2.313537 2.522415 2.749776
1 France 2.891696 3.091695 3.312360
2 Germany 3.566319 3.644270 3.705208
3 Italy 2.771634 3.015264 3.265062
4 Japan 3.346039 3.504695 3.656504

While passing a list [] to .agg() is great for applying multiple math functions to a single variable, you might want to calculate different statistics for different variables simultaneously.

To do this, we pass a dictionary {} to the .agg() method. The “Key” is the column name, and the “Value” is the function you want to apply.

Basic Syntax

df.groupby('grouping_column').agg({
    'variable_1': 'mean',
    'variable_2': 'sum',
    'variable_3': 'max'
})

Example: Aggregate different variables with different methods

year_summary = df.groupby('year').agg({
    'pop': 'sum',        # Total world population
    'hc': 'mean'         # Average human capital level
}).reset_index()
year_summary.head()
year pop hc
0 2000 2043.434364 3.102620
1 2001 2057.961651 3.117717
2 2002 2071.574431 3.132900
3 2003 2084.790630 3.148169
4 2004 2098.076959 3.163525

Changing / Creating Variables

Renaming Variables

Datasets often come with short or cryptic column names. Renaming makes your code easier to read and reduces mistakes.

  • rename(columns={...}) lets you rename one or more columns using a dictionary.
  • It returns a new DataFrame, so it’s common to reassign: df = df.rename(...).

Example:

# Rename columns to clearer names
df = df.rename(columns={
    "rgdpe": "gdp",   # GDP measured in millions
})
df.head()
iso3 Country year hc pop gdp xr
0 CHE Switzerland 2000 3.528448 7.184007 368548.87500 1.688843
1 CHE Switzerland 2001 3.538948 7.226391 369422.81250 1.687615
2 CHE Switzerland 2002 3.549480 7.278752 366579.53125 1.558607
3 CHE Switzerland 2003 3.560044 7.333447 362704.65625 1.346651
4 CHE Switzerland 2004 3.570638 7.384194 370410.31250 1.243496

Creating New Variables

A derived variable is a new column computed from existing columns. This is how you change units, create ratios, and build features for analysis. Pandas uses vectorized operations. This means you don’t need to write a loop to calculate math for each row. If you divide one column by another, Pandas automatically does the math row-by-row.

Example: ‘gdp’ is measured in millions USD, create a new variable ‘gdp_bn’ which is in billions

df["gdp_bn"] = df["gdp"] / 1000
df.head()
iso3 Country year hc pop gdp xr gdp_bn
0 CHE Switzerland 2000 3.528448 7.184007 368548.87500 1.688843 368.548875
1 CHE Switzerland 2001 3.538948 7.226391 369422.81250 1.687615 369.422813
2 CHE Switzerland 2002 3.549480 7.278752 366579.53125 1.558607 366.579531
3 CHE Switzerland 2003 3.560044 7.333447 362704.65625 1.346651 362.704656
4 CHE Switzerland 2004 3.570638 7.384194 370410.31250 1.243496 370.410312

Example: create a variable ‘gdp_pc’ (gdp per capita) by dividing ‘gdp’ by population ‘pop’

df["gdp_pc"] = df["gdp"] / df["pop"]
df.head()
iso3 Country year hc pop gdp xr gdp_bn gdp_pc
0 CHE Switzerland 2000 3.528448 7.184007 368548.87500 1.688843 368.548875 51301.296755
1 CHE Switzerland 2001 3.538948 7.226391 369422.81250 1.687615 369.422813 51121.342936
2 CHE Switzerland 2002 3.549480 7.278752 366579.53125 1.558607 366.579531 50362.964867
3 CHE Switzerland 2003 3.560044 7.333447 362704.65625 1.346651 362.704656 49458.959238
4 CHE Switzerland 2004 3.570638 7.384194 370410.31250 1.243496 370.410312 50162.592221

Saving a DataFrame

After spending time cleaning, filtering, and transforming your data, you don’t want to lose that work. Saving a DataFrame lets you:

  • share the cleaned dataset with someone else
  • open it in another program (Excel, Stata, R, etc.)
  • reload it later without rerunning all your code

In pandas, saving is done with “export” methods. The pattern is: df.to_...("filename")

Save as CSV (most common)

You can save your data frame as a csv file using .to_csv().

CSV is the most universal format—it works in Excel, Google Sheets, R, Stata, and more.

By default, pandas will also save the DataFrame’s index (0, 1, 2, …) as an extra column. Most of the time you don’t want that, so use index=False.

df.head() # Check again if it looks the way it should
iso3 Country year hc pop gdp xr gdp_bn gdp_pc
0 CHE Switzerland 2000 3.528448 7.184007 368548.87500 1.688843 368.548875 51301.296755
1 CHE Switzerland 2001 3.538948 7.226391 369422.81250 1.687615 369.422813 51121.342936
2 CHE Switzerland 2002 3.549480 7.278752 366579.53125 1.558607 366.579531 50362.964867
3 CHE Switzerland 2003 3.560044 7.333447 362704.65625 1.346651 362.704656 49458.959238
4 CHE Switzerland 2004 3.570638 7.384194 370410.31250 1.243496 370.410312 50162.592221
# Save the cleaned dataframe to a CSV file
df.to_csv("data/cleaned_countries_data.csv", index=False)

You can save to other formats as well - to_excel() - to_stata()

Save to Pickle

In Python, pickling is the process of saving a Python object (like a pandas DataFrame) as a file by converting it into a stream of bytes. When you load it back, Python reconstructs the object exactly.

Why use Pickle?

  1. Speed: Reading/writing is often much faster than CSV for larger datasets.
  2. Preserves metadata: It keeps your DataFrame exactly as it is (data types, datetime columns, index, etc.).
  3. File size: It’s often smaller than CSV, and you can also save a compressed pickle if you want.

Syntax:

  • Save to pickle syntax: df.to_pickle("file.pkl")
  • When you read it back later syntax: `pd.read_pickle(“file.pkl”)

Warning: Only open pickle files from people you trust. A malicious pickle can execute code when it is loaded. For sharing data widely, prefer formats like CSV (or other non-executable data formats).

# Save the dataframe as a Pickle file
df.to_pickle("data/cleaned_countries_data.pkl")
# Read it back later
test_pkl_df = pd.read_pickle("data/cleaned_countries_data.pkl")
test_pkl_df.head()
iso3 Country year hc pop gdp xr gdp_bn gdp_pc
0 CHE Switzerland 2000 3.528448 7.184007 368548.87500 1.688843 368.548875 51301.296755
1 CHE Switzerland 2001 3.538948 7.226391 369422.81250 1.687615 369.422813 51121.342936
2 CHE Switzerland 2002 3.549480 7.278752 366579.53125 1.558607 366.579531 50362.964867
3 CHE Switzerland 2003 3.560044 7.333447 362704.65625 1.346651 362.704656 49458.959238
4 CHE Switzerland 2004 3.570638 7.384194 370410.31250 1.243496 370.410312 50162.592221

Visualization with Matplotlib

Installation / Import

  1. Installation: If you’re using it for the first time you need to install it. Run uv add matplotlib in the terminal
  2. Import the package import matplotlib.pyplot as plt
  • matplotlib: The main plotting library. It provides the core machinery for creating figures and rendering graphics.
  • matplotlib.pyplot: A user-friendly module that gives a simple “plotting commands” interface (it manages the current figure/axes for you).
  • plt: The conventional nickname
import matplotlib.pyplot as plt

Your First Plot

Let’s make a simple line plot of GDP per capita (gdp_pc) over time for Switzerland.

Basic Syntax:

  1. plt.plot(x, y) to draw the plot
  2. plt.show() to display it

Matplotlib plots whatever x–y values you give it, so we first create a smaller DataFrame that only contains Switzerland. x values will be ‘year’ and the y values will be ‘gdp_pc’. For a line plot, it’s important that the x-axis (time) is sorted.

# Filter to Switzerland and sort by year
df_switzerland = df[df["Country"] == "Switzerland"]
df_switzerland = df_switzerland.sort_values("year")
# df_switzerland.head()
# Plot year (x-axis) vs gdp_pc (y-axis)
plt.plot(df_switzerland["year"], df_switzerland["gdp_pc"])

# Show the plot
plt.show()

Without labels, a plot is just a mystery line on a grid. You should add the relevant context

  • plt.xlabel('some text'): Add a label for the x axis, for example ‘Year’
  • plt.ylabel('some text'): Add a label for the y axis, for example ‘GDP per Capita (mio. USD)’
  • plt.title('some text'): Add a title: the title should be descriptive enough that a reader can understand the chart
# Create the plot
# Plot year (x-axis) vs gdp_pc (y-axis)
plt.plot(df_switzerland["year"], df_switzerland["gdp_pc"])

# NEW: Add labels so the plot is readable
plt.title("GDP per capita over time — Switzerland")
plt.xlabel("Year")
plt.ylabel("GDP per capita (mio. USD)")

# Show the plot
plt.show()

Some useful layout methods:

  • Size: You can initizalize the figure with plt.figure() and determine the size using the argument figsize = (width, height)
  • Grid: You can add a grid, which can help the reader to map a specific year to a specific value using plt.grid()
  • Layout: Use plt.tight_layout() before you show the plot. It automatically adjusts the spacing in your figure so that things like the title, axis labels, and tick labels don’t overlap or get cut off.
# NEW: initialize and set a size
plt.figure(figsize=(8, 5))

# Plot year (x-axis) vs gdp_pc (y-axis)
plt.plot(df_switzerland["year"], df_switzerland["gdp_pc"])

# Add labels so the plot is readable
plt.title("GDP per capita over time — Switzerland")
plt.xlabel("Year")
plt.ylabel("GDP per capita (mio. USD)")

# NEW: Add a grid
plt.grid()

# Show the plot
# NEW: tight layout
plt.tight_layout()
plt.show()

Types of Plots

Here are some of the most common plot types in Matplotlib, what they’re for, and the basic syntax

  • Line plot (trend over time or another ordered variable)
    Syntax: plt.plot(x, y)
  • Scatter plot (relationship between two numeric variables; each point is one observation)
    Syntax: plt.scatter(x, y)
  • Bar chart (compare values across categories)
    Syntax: plt.bar(categories, values) for vertical bars
    Syntax: plt.barh(categories, values) for horizontal bars.
  • Histogram (distribution of a single numeric variable)
    Syntax: plt.hist(x, bins=...)
  • Box plot (distribution + median + spread; easy comparison across groups)
    Syntax: plt.boxplot(data)
  • Pie chart (shares of a whole; generally best for a small number of categories)
    Syntax: plt.pie(values, labels=labels)

Bar Chart

A bar chart is used when you want to compare a single numeric value across different categorie (for example, a value per country, sector, or group).

  • Each bar represents one category
  • The length/height of the bar represents the value
  • Bar charts are easiest to read when categories are sorted.
  • For long category names (like country names), a horizontal bar chart (plt.barh) is often more readable than a vertical one.

Syntax:

  • Vertical: plt.bar(x_categories, y_values)
  • Horizontal: plt.barh(y_categories, x_values)

Example: Human capital index ‘hc’ of the different countries in 2023

# 1) Filter the data for the year 2023 and sort countries by hc
df_2023 = df[df["year"] == 2023].sort_values("hc", ascending=True)

# 2) Initialize the figure (controls plot size)
plt.figure(figsize=(8, 4.5))

# 3) Create a horizontal bar chart
plt.barh(df_2023["Country"], df_2023["hc"])

# 4) Add labels so the chart is interpretable
plt.title("Human Capital Index by Country (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("")  # optional: remove label if it's obvious from the country names

# 5) Final layout and display
plt.tight_layout()
plt.show()

Scatterplot

A scatter plot is used to visualize the relationship between two numeric variables. Each dot represents one observation (for example: one country in one year).

Scatter plots are useful for spotting: - Positive or negative relationships (do values tend to rise together?) - Clusters (groups of points that look similar) - Outliers (unusual observations)

Basic Syntax: plt.scatter(x, y)

Example: Create a scatterplot for human capital index (‘hc’) and GDP per capita (‘gdp_pc’) for all countries and years

plt.figure(figsize=(8, 6))

# Use the full dataframe 'df'
plt.scatter(df['hc'], df['gdp_pc'])

# Add Context
plt.title('Human Capital vs. GDP per Capita: All Countries, All Years (2000-2023)')
plt.xlabel('Human Capital Index')
plt.ylabel('GDP per Capita (USD)')

plt.grid(True, linestyle='--', alpha=0.2)
plt.tight_layout()
plt.show()

Histogram

A Histogram is the go-to tool for understanding the “shape” of your data. A histogram shows the distribution of values across your data. We use histograms to see the frequency of certain outcomes. It takes a continuous variable (like GDP growth) and splits it into “bins” (ranges). It then counts how many data points fall into each bin.

Syntax: plt.hist(data, bins=10)

  • data: The column of your dataframe for which you want to see the distribution of values
  • bins: The number of bins.

Useful additional arguments in plt.hist(): - color: define the color that fills the bar - edgecolor: define the color of the border, usually white or black. Helps to separate the bars - …

plt.figure(figsize=(8, 5))

# Use the 'gdp_growth' column
# .dropna() is important because histograms can't handle missing values
plt.hist(df['gdp_pc'].dropna(), bins=30, color='steelblue', edgecolor='white')

# Add Context
plt.title('Distribution of GDP per Capita (2000-2023)', fontsize=14)
plt.xlabel('GDP per Capita', fontsize=12)
plt.ylabel('Frequency (Number of Occurrences)', fontsize=12)

plt.grid(axis='y', alpha=0.3)
plt.show()

Next Week

Next week we’ll build on what we learned today and go a step further:

  • Time series data: working with dates and time, and plotting trends over time more effectively
  • Messier real-world data: handling missing values and other common data issues
  • Customizing plots: improving readability with better labels, formatting, scales, and styling options
  • Reusable plotting functions: writing Python functions that create figures so you can make the same plot quickly for different countries/variables