import pandas as pd--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[1], line 1 ----> 1 import pandas as pd ModuleNotFoundError: No module named 'pandas'
Franziska Bender
March 14, 2026
In this exercise, we move from our simplified “teaching” dataset to a broader download from the Penn World Table (PWT). Download the data pwt_data_ex5.csv and save it in a folder data. This version contains more countries and a wider range of macroeconomic variables:
| Variable | Full Name / Definition |
|---|---|
| iso3 | 3-letter ISO Country Code |
| Country | Full Country Name |
| year | Observation Year |
| pop | Population (in millions) |
| rgdpna | Real GDP at constant 2021 national prices (in mil. 2021USD) |
| rnna | Capital stock at constant 2021 national prices (in mil. 2021USD) |
| emp | Number of persons engaged (in millions) |
| avh | Average annual hours worked by persons engaged |
| hc | Human capital index (based on years of schooling) |
| ctfp | TFP level at current PPPs (USA = 1) |
| labsh | Share of labor compensation in GDP |
| pl_con | Price level of household consumption |
Import Packages
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[1], line 1 ----> 1 import pandas as pd ModuleNotFoundError: No module named 'pandas'
In this exercise, you will practice loading a comprehensive macroeconomic dataset into a DataFrame and performing an initial inspection to understand its dimensionality, variable types, and the unique entities it contains.
Your Tasks:
data/pwt_data_ex5.csv into a DataFrame named df.--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[2], line 2 1 # 1. Load the data ----> 2 df = pd.read_csv("data/pwt_data_ex5.csv") 3 df.head() NameError: name 'pd' is not defined
You’ll notice immediately the NaN entries in the dataframe. In pandas, NaN stands for “Not a Number” and is commonly used to represent missing or undefined data in a Series or DataFrame.
For this exercise we’re going to ignore them. Handling missing data is something we’ll cover in week 6.
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[3], line 2 1 # 2. Size Check ----> 2 print(f"Dataset Shape: {df.shape}") 3 # df.shape returns (rows, columns) NameError: name 'df' is not defined
# 3. Variable Types and Non-Null counts
# .info() is perfect for seeing Dtypes and identifying missing data at a glance
df.info()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[4], line 3 1 # 3. Variable Types and Non-Null counts 2 # .info() is perfect for seeing Dtypes and identifying missing data at a glance ----> 3 df.info() NameError: name 'df' is not defined
# 4. Country Count
# We use .nunique() on the 'Country' column to count unique entries
num_countries = df['Country'].nunique()
print(f"There are {num_countries} countries in this dataset.")--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[5], line 3 1 # 4. Country Count 2 # We use .nunique() on the 'Country' column to count unique entries ----> 3 num_countries = df['Country'].nunique() 4 print(f"There are {num_countries} countries in this dataset.") NameError: name 'df' is not defined
# 5. Summary Statistics
# .describe() gives us the mean, std, and quartiles for numeric columns
df.describe()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[6], line 3 1 # 5. Summary Statistics 2 # .describe() gives us the mean, std, and quartiles for numeric columns ----> 3 df.describe() NameError: name 'df' is not defined
In this exercise, you will practice extracting specific subsets of data from your main DataFrame df.
Your Tasks:
df_2023 that contains only observations for the year 2023.df_switzerland that contains all years of data for Switzerland.'pop') was greater than 1,000 million (1 billion), which countries remain?df_comparison that contains only the data for Germany and France for the years 2000 to 2010 (inclusive). Hint: You will need to use the & operator and parentheses for multiple conditions.--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[8], line 2 1 # 2. The Country Focus ----> 2 df_switzerland = df[df['Country'] == 'Switzerland'] NameError: name 'df' is not defined
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[9], line 2 1 # 3. High-Population Economies ----> 2 high_pop = df[df['pop'] > 1000] 3 high_pop['Country'].unique() NameError: name 'df' is not defined
# 4. Specific Comparison
# We use .isin() for the countries and logical operators for the years
countries = ['Germany', 'France']
mask = (df['Country'].isin(countries)) & (df['year'] >= 2000) & (df['year'] <= 2010)
df_comparison = df[mask]
print(f"Size of comparison group: {len(df_comparison)}")--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[10], line 4 1 # 4. Specific Comparison 2 # We use .isin() for the countries and logical operators for the years 3 countries = ['Germany', 'France'] ----> 4 mask = (df['Country'].isin(countries)) & (df['year'] >= 2000) & (df['year'] <= 2010) 5 df_comparison = df[mask] 6 print(f"Size of comparison group: {len(df_comparison)}") NameError: name 'df' is not defined
In this exercise, you will create derived variables to help analyze productivity and living standards across different countries.
Your Tasks:
GDP per Capita: Create a new column called gdp_pc by dividing Real GDP (rgdpna) by Population (pop).
Capital Intensity: Create a column called k_labor which represents the amount of Capital Stock (rnna) available per person engaged (emp).
Labor Productivity: Create a column called labor_prod by dividing Real GDP (rgdpna) by the total number of hours worked (emp x avh).
Note: avh has many missing values, which will result in NaN for those specific rows in your new column. We’ll learn how to deal with missing values next week.
Display Results: Show the first 5 rows of the newly created variables, country and year.
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[11], line 3 1 # 1. GDP per Capita 2 # Formula: Real GDP / Population ----> 3 df['gdp_pc'] = df['rgdpna'] / df['pop'] NameError: name 'df' is not defined
# 2. Capital Intensity (Capital per worker)
# Formula: Capital Stock / Employment
df['k_labor'] = df['rnna'] / df['emp']--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[12], line 3 1 # 2. Capital Intensity (Capital per worker) 2 # Formula: Capital Stock / Employment ----> 3 df['k_labor'] = df['rnna'] / df['emp'] NameError: name 'df' is not defined
# 3. Labor Productivity (Output per hour)
# We multiply employment by average hours to get total labor hours
df['labor_prod'] = df['rgdpna'] / (df['emp'] * df['avh'])--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[13], line 3 1 # 3. Labor Productivity (Output per hour) 2 # We multiply employment by average hours to get total labor hours ----> 3 df['labor_prod'] = df['rgdpna'] / (df['emp'] * df['avh']) NameError: name 'df' is not defined
# Display the first few rows to verify the new columns
df[['Country', 'year', 'gdp_pc', 'k_labor', 'labor_prod']].head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[14], line 2 1 # Display the first few rows to verify the new columns ----> 2 df[['Country', 'year', 'gdp_pc', 'k_labor', 'labor_prod']].head() NameError: name 'df' is not defined
In this exercise, you will practice summarizing your data. Instead of looking at individual rows, we want to understand broader trends by country or by year.
Your Tasks:
'gdp_pc') and their average labor share ('labsh') across all available years.'labor_prod').country_stats that shows the mean, min, and max of the Human Capital index ('hc') for every country.# We group by 'year' and sum population, then average TFP
global_trends = df.groupby('year').agg({
'pop': 'sum',
'ctfp': 'mean'
}).reset_index()
global_trends.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[15], line 2 1 # We group by 'year' and sum population, then average TFP ----> 2 global_trends = df.groupby('year').agg({ 3 'pop': 'sum', 4 'ctfp': 'mean' 5 }).reset_index() 6 global_trends.head() NameError: name 'df' is not defined
# Find the peak wealth and typical labor share for each nation
country_profiles = df.groupby('Country').agg({
'gdp_pc': 'max',
'labsh': 'mean'
}).reset_index()
country_profiles.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[16], line 2 1 # Find the peak wealth and typical labor share for each nation ----> 2 country_profiles = df.groupby('Country').agg({ 3 'gdp_pc': 'max', 4 'labsh': 'mean' 5 }).reset_index() 6 country_profiles.head() NameError: name 'df' is not defined
# Labor Productivity over time
productivity_trend = df.groupby('year')['labor_prod'].mean()
productivity_trend.head()
# Optional: Quick visualization to check the trend
# productivity_trend.plot(title="Global Labor Productivity Over Time")--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[17], line 2 1 # Labor Productivity over time ----> 2 productivity_trend = df.groupby('year')['labor_prod'].mean() 3 productivity_trend.head() 4 # Optional: Quick visualization to check the trend 5 # productivity_trend.plot(title="Global Labor Productivity Over Time") NameError: name 'df' is not defined
# Using .agg() with a list of functions for a single column
country_stats = df.groupby('Country')['hc'].agg(['mean', 'min', 'max']).reset_index()
country_stats.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[18], line 2 1 # Using .agg() with a list of functions for a single column ----> 2 country_stats = df.groupby('Country')['hc'].agg(['mean', 'min', 'max']).reset_index() 3 country_stats.head() NameError: name 'df' is not defined
In this exercise, you will use matplotlib to create figures that tell a story about global development and productivity.
Your Tasks:
plt.yscale('log').# 1. Calculate the average human capital for all countries by year
hc_trend = df.groupby('year')['hc'].mean()
hc_trend = hc_trend.reset_index()
# 2. Create the figure
# Initialize figure and set a figsize
plt.figure(figsize=(8, 5))
# plot year on x-axis and 'hc' on y axis
plt.plot(hc_trend['year'], hc_trend['hc'])
# Set a title and labels
plt.title("Global Average Human Capital Index (2000–2023)")
plt.xlabel("Year")
plt.ylabel("Average Human Capital Index")
# Add a grid
plt.grid(True)
# Show the figure
plt.show()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[20], line 2 1 # 1. Calculate the average human capital for all countries by year ----> 2 hc_trend = df.groupby('year')['hc'].mean() 3 hc_trend = hc_trend.reset_index() 5 # 2. Create the figure 6 7 # Initialize figure and set a figsize NameError: name 'df' is not defined
# Filter your data for 2023
df_2023 = df[df['year'] == 2023]
# Create the scatterplot
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD)")
plt.show()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[21], line 2 1 # Filter your data for 2023 ----> 2 df_2023 = df[df['year'] == 2023] 5 # Create the scatterplot 6 plt.figure(figsize=(8, 6)) NameError: name 'df' is not defined
# Create the same plot with a log scale for the y-axis
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.yscale('log') # Log scale for GDP is standard in macroeconomics
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD, Log Scale)")
plt.show()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[22], line 2 1 # Create the same plot with a log scale for the y-axis ----> 2 plt.figure(figsize=(8, 6)) 3 plt.scatter(df_2023['hc'], df_2023['gdp_pc']) 4 plt.yscale('log') # Log scale for GDP is standard in macroeconomics NameError: name 'plt' is not defined
# Filter the 2023 data from previously for the countries, and sort values
selection = ['United States', 'China', 'India', 'Germany','Switzerland', 'Brazil', 'Nigeria']
df_sel = df_2023[df_2023['Country'].isin(selection)]
df_sel = df_sel.sort_values('ctfp')
# Create horizontal barchart
plt.figure(figsize=(8, 5))
plt.barh(df_sel['Country'], df_sel['ctfp'])
plt.title("Total Factor Productivity Levels in 2023 (USA = 1)")
plt.xlabel("TFP Level (Relative to USA)")
plt.show()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[23], line 3 1 # Filter the 2023 data from previously for the countries, and sort values 2 selection = ['United States', 'China', 'India', 'Germany','Switzerland', 'Brazil', 'Nigeria'] ----> 3 df_sel = df_2023[df_2023['Country'].isin(selection)] 4 df_sel = df_sel.sort_values('ctfp') 6 # Create horizontal barchart NameError: name 'df_2023' is not defined
plt.figure(figsize=(8, 5))
plt.hist(df['labsh'].dropna(), bins=30, edgecolor='white')
plt.title("Distribution of Labor Shares Across All Observations")
plt.xlabel("Labor Share of GDP")
plt.ylabel("Frequency")
plt.show()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[24], line 1 ----> 1 plt.figure(figsize=(8, 5)) 2 plt.hist(df['labsh'].dropna(), bins=30, edgecolor='white') 3 plt.title("Distribution of Labor Shares Across All Observations") NameError: name 'plt' is not defined
In this independent deep dive, you will apply the ‘Solow Residual’ method to decompose GDP growth into its fundamental drivers—capital, labor, and technology—transforming a core macroeconomic theory into a practical data analysis.
In this exercise, we assume a Cobb-Douglas production function:
\[Y_t = A_t K_t^{\alpha}L_t^{1-\alpha}\]
Where \(Y\) is output, \(K\) is capital, \(L\) is labor, and \(A\) is technology (TFP). Our goal is to calculate the growth rate of \(A\). We want to understand how much of a country’s GDP growth comes from the three components (i) labor, (ii) capital and (iii) technological progress.
ga_df that includes only the United States and Switzerland# 1. Filter and Sort
countries = ['United States', 'Switzerland']
ga_df = df[df['Country'].isin(countries)].copy()
ga_df = ga_df.sort_values(['Country', 'year'])
ga_df.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[25], line 3 1 # 1. Filter and Sort 2 countries = ['United States', 'Switzerland'] ----> 3 ga_df = df[df['Country'].isin(countries)].copy() 4 ga_df = ga_df.sort_values(['Country', 'year']) 5 ga_df.head() NameError: name 'df' is not defined
g_y: GDP growth (growth rate of rgdpna)g_l: Labor growth (growth rate of emp)g_k: Capital growth (growth rate of rnna)You can calculate growth rates with .pct_change(). It will calculate the percentage change of a value and the previous value in your dataframe. This is why sorting is important, we want to calculate the change from one year to the next.
Important: We don’t want to accidentally compare the last value of switzerland to the first value of the United States. To avoid this you should group by ‘Country’ before you calculate the growth rates.
ga_df = ga_df.sort_values(['Country', 'year'])
ga_df['g_y'] = ga_df.groupby('Country')['rgdpna'].pct_change()
ga_df['g_k'] = ga_df.groupby('Country')['rnna'].pct_change()
ga_df['g_l'] = ga_df.groupby('Country')['emp'].pct_change()
ga_df.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[26], line 1 ----> 1 ga_df = ga_df.sort_values(['Country', 'year']) 2 ga_df['g_y'] = ga_df.groupby('Country')['rgdpna'].pct_change() 3 ga_df['g_k'] = ga_df.groupby('Country')['rnna'].pct_change() NameError: name 'ga_df' is not defined
A crucial parameter in the production function is the capital share \(\alpha\). We assume that this is constant.
capital_share_annual, which is 1 minus the labor share labshalpha_che for switzerland, alpha_usa for the United States.locSo far, you have used .loc to find specific data (e.g., df.loc[row, column]). However, .loc is also one of the most powerful tools for creating or updating variables based on a condition. When you use the assignment operator (=) with .loc, you are telling Python: “Find all the rows that meet my condition, look at this specific column, and write this value there.”
# create variable cap_share_annual
ga_df['cap_share_annual'] = 1 - ga_df['labsh']
# Calculate the average alpha for each country
alpha_usa = ga_df[ga_df['Country'] == 'United States']['cap_share_annual'].mean()
alpha_che = ga_df[ga_df['Country']=='Switzerland']['cap_share_annual'].mean()
# Assign the average alpha back to the dataframe
ga_df.loc[ga_df['Country'] == 'United States', 'alpha'] = alpha_usa
ga_df.loc[ga_df['Country'] == 'Switzerland', 'alpha'] = alpha_che
ga_df.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[27], line 2 1 # create variable cap_share_annual ----> 2 ga_df['cap_share_annual'] = 1 - ga_df['labsh'] 4 # Calculate the average alpha for each country 5 alpha_usa = ga_df[ga_df['Country'] == 'United States']['cap_share_annual'].mean() NameError: name 'ga_df' is not defined
We want to calculate TFP growth (i.e. the growth of \(A\) in the production function).
\[g_A = g_Y - [\alpha g_K+(1−\alpha)g_L]\]
g_lg_kg_yalphaIf you want to derive this formula: apply \(ln\) to the production function, then use the fact that the change in the natural log of a variable is approximately equal to its growth rate (g)
# 4. Calculate TFP Growth
ga_df['g_tfp'] = ga_df['g_y'] - (ga_df['alpha'] * ga_df['g_k'] + (1 - ga_df['alpha']) * ga_df['g_l'])
ga_df.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[28], line 2 1 # 4. Calculate TFP Growth ----> 2 ga_df['g_tfp'] = ga_df['g_y'] - (ga_df['alpha'] * ga_df['g_k'] + (1 - ga_df['alpha']) * ga_df['g_l']) 3 ga_df.head() NameError: name 'ga_df' is not defined
First we want to create a summary table
summary with the mean of the growth variables and alpha ['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp'] by countryg_k) and Labor (g_l) don’t tell the whole story. We must multiply them by their respective “shares” in the economy to see how much they actually contributed to total GDP growth:
contrib_k: The contribution of capital which is \(\alpha g_K\)contrib_l: The contribution of labor which is \((1-\alpha)g_L\)# 1. Average the components
summary = ga_df.groupby('Country')[['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp']].mean()
# 2. Calculate the final contributions
summary['contrib_k'] = summary['alpha'] * summary['g_k']
summary['contrib_l'] = (1 - summary['alpha']) * summary['g_l']
# 3. Show the summary table
summary = summary.reset_index()
summary--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[29], line 2 1 # 1. Average the components ----> 2 summary = ga_df.groupby('Country')[['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp']].mean() 4 # 2. Calculate the final contributions 5 summary['contrib_k'] = summary['alpha'] * summary['g_k'] NameError: name 'ga_df' is not defined
Then we create a stacked barchart to visualize our results.
The code below creates a figure that decomposes average annual GDP growth into the specific contributions of capital accumulation, labor input, and Total Factor Productivity (TFP) to illustrate the underlying drivers of economic growth for each country.
We haven’t covered “stacked” charts in class, deduce how this works by looking at the three plt.bar() calls.
# 1. Setup the data from the columns
countries = summary['Country']
tfp_part = summary['g_tfp']
k_part = summary['contrib_k']
l_part = summary['contrib_l']
# 2. Create the figure
plt.figure(figsize=(8, 4.5))
# Plot Layer 1: TFP (The base)
plt.bar(countries, tfp_part, label='TFP Growth')
# Plot Layer 2: Capital (Stacked on top of TFP)
plt.bar(countries, k_part, bottom=tfp_part, label='Capital Contribution')
# Plot Layer 3: Labor (Stacked on top of TFP + Capital)
plt.bar(countries, l_part, bottom=tfp_part + k_part, label='Labor Contribution')
# 3. Add styling
plt.ylabel("Average Annual Growth Rate")
plt.title("Growth Accounting Decomposition")
plt.legend()
plt.tight_layout()
plt.show()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[30], line 2 1 # 1. Setup the data from the columns ----> 2 countries = summary['Country'] 3 tfp_part = summary['g_tfp'] 4 k_part = summary['contrib_k'] NameError: name 'summary' is not defined
We call plt.bar() three separate times. Each call tells Matplotlib to draw a set of bars on the same figure.
The bottom Argument: This is the secret to “stacking.” By default, Matplotlib starts every bar at zero.
bottom=tfp_part. This tells Python: “Don’t start the Capital bars at zero; start them at the height where the TFP bars ended.”bottom=tfp_part + k_part. This stacks the Labor bars on top of both the TFP and Capital layers combined.Labels and Legend:
plt.bar() function, we provide a label (e.g., ‘TFP Growth’).plt.legend() at the end. This command looks at all the labels we’ve defined and creates the “Key” or “Legend” in the corner of the chart so the reader knows which color represents which economic component.