Exercise 5: Pandas

Author

Franziska Bender

Published

March 14, 2026

In this exercise, we move from our simplified “teaching” dataset to a broader download from the Penn World Table (PWT). Download the data pwt_data_ex5.csv and save it in a folder data. This version contains more countries and a wider range of macroeconomic variables:

Variable Full Name / Definition
iso3 3-letter ISO Country Code
Country Full Country Name
year Observation Year
pop Population (in millions)
rgdpna Real GDP at constant 2021 national prices (in mil. 2021USD)
rnna Capital stock at constant 2021 national prices (in mil. 2021USD)
emp Number of persons engaged (in millions)
avh Average annual hours worked by persons engaged
hc Human capital index (based on years of schooling)
ctfp TFP level at current PPPs (USA = 1)
labsh Share of labor compensation in GDP
pl_con Price level of household consumption

Import Packages

import pandas as pd
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd

ModuleNotFoundError: No module named 'pandas'

Exercise 1: Data Loading and Initial Inspection

In this exercise, you will practice loading a comprehensive macroeconomic dataset into a DataFrame and performing an initial inspection to understand its dimensionality, variable types, and the unique entities it contains.

Your Tasks:

  1. Load the data: Read the file data/pwt_data_ex5.csv into a DataFrame named df.
  2. Size Check: How many observations (rows) and variables (columns) does this dataset have?
  3. Variable Types: Use a method to see which columns are categorical (strings) and which are numeric.
  4. Country Count: How many unique countries are represented in this dataset?
  5. Summary Statistics: Display the mean and standard deviation for the numeric variables.
# 1. Load the data
df = pd.read_csv("data/pwt_data_ex5.csv")
df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 2
      1 # 1. Load the data
----> 2 df = pd.read_csv("data/pwt_data_ex5.csv")
      3 df.head()

NameError: name 'pd' is not defined

You’ll notice immediately the NaN entries in the dataframe. In pandas, NaN stands for “Not a Number” and is commonly used to represent missing or undefined data in a Series or DataFrame.

For this exercise we’re going to ignore them. Handling missing data is something we’ll cover in week 6.

# 2. Size Check
print(f"Dataset Shape: {df.shape}")
# df.shape returns (rows, columns)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 2
      1 # 2. Size Check
----> 2 print(f"Dataset Shape: {df.shape}")
      3 # df.shape returns (rows, columns)

NameError: name 'df' is not defined
# 3. Variable Types and Non-Null counts
# .info() is perfect for seeing Dtypes and identifying missing data at a glance
df.info()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 3
      1 # 3. Variable Types and Non-Null counts
      2 # .info() is perfect for seeing Dtypes and identifying missing data at a glance
----> 3 df.info()

NameError: name 'df' is not defined
# 4. Country Count
# We use .nunique() on the 'Country' column to count unique entries
num_countries = df['Country'].nunique()
print(f"There are {num_countries} countries in this dataset.")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 3
      1 # 4. Country Count
      2 # We use .nunique() on the 'Country' column to count unique entries
----> 3 num_countries = df['Country'].nunique()
      4 print(f"There are {num_countries} countries in this dataset.")

NameError: name 'df' is not defined
# 5. Summary Statistics
# .describe() gives us the mean, std, and quartiles for numeric columns
df.describe()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 3
      1 # 5. Summary Statistics
      2 # .describe() gives us the mean, std, and quartiles for numeric columns
----> 3 df.describe()

NameError: name 'df' is not defined

Exercise 2: Basic Filtering

In this exercise, you will practice extracting specific subsets of data from your main DataFrame df.

Your Tasks:

  1. The Recent Slice: Create a new DataFrame called df_2023 that contains only observations for the year 2023.
  2. The Country Focus: Create a new DataFrame called df_switzerland that contains all years of data for Switzerland.
  3. High-Population Economies: Filter the dataset to find all observations where the population ('pop') was greater than 1,000 million (1 billion), which countries remain?
  4. Specific Comparison: Create a DataFrame called df_comparison that contains only the data for Germany and France for the years 2000 to 2010 (inclusive). Hint: You will need to use the & operator and parentheses for multiple conditions.
# 1. The Recent Slice
df_2023 = df[df['year'] == 2023]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 2
      1 # 1. The Recent Slice
----> 2 df_2023 = df[df['year'] == 2023]

NameError: name 'df' is not defined
# 2. The Country Focus
df_switzerland = df[df['Country'] == 'Switzerland']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 # 2. The Country Focus
----> 2 df_switzerland = df[df['Country'] == 'Switzerland']

NameError: name 'df' is not defined
# 3. High-Population Economies
high_pop = df[df['pop'] > 1000]
high_pop['Country'].unique()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 2
      1 # 3. High-Population Economies
----> 2 high_pop = df[df['pop'] > 1000]
      3 high_pop['Country'].unique()

NameError: name 'df' is not defined
# 4. Specific Comparison
# We use .isin() for the countries and logical operators for the years
countries = ['Germany', 'France']
mask = (df['Country'].isin(countries)) & (df['year'] >= 2000) & (df['year'] <= 2010)
df_comparison = df[mask]
print(f"Size of comparison group: {len(df_comparison)}")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 4
      1 # 4. Specific Comparison
      2 # We use .isin() for the countries and logical operators for the years
      3 countries = ['Germany', 'France']
----> 4 mask = (df['Country'].isin(countries)) & (df['year'] >= 2000) & (df['year'] <= 2010)
      5 df_comparison = df[mask]
      6 print(f"Size of comparison group: {len(df_comparison)}")

NameError: name 'df' is not defined

Exercise 3: Creating New Variables

In this exercise, you will create derived variables to help analyze productivity and living standards across different countries.

Your Tasks:

  1. GDP per Capita: Create a new column called gdp_pc by dividing Real GDP (rgdpna) by Population (pop).

  2. Capital Intensity: Create a column called k_labor which represents the amount of Capital Stock (rnna) available per person engaged (emp).

  3. Labor Productivity: Create a column called labor_prod by dividing Real GDP (rgdpna) by the total number of hours worked (emp x avh).

    Note: avh has many missing values, which will result in NaN for those specific rows in your new column. We’ll learn how to deal with missing values next week.

  4. Display Results: Show the first 5 rows of the newly created variables, country and year.

# 1. GDP per Capita
# Formula: Real GDP / Population
df['gdp_pc'] = df['rgdpna'] / df['pop']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 3
      1 # 1. GDP per Capita
      2 # Formula: Real GDP / Population
----> 3 df['gdp_pc'] = df['rgdpna'] / df['pop']

NameError: name 'df' is not defined
# 2. Capital Intensity (Capital per worker)
# Formula: Capital Stock / Employment
df['k_labor'] = df['rnna'] / df['emp']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 3
      1 # 2. Capital Intensity (Capital per worker)
      2 # Formula: Capital Stock / Employment
----> 3 df['k_labor'] = df['rnna'] / df['emp']

NameError: name 'df' is not defined
# 3. Labor Productivity (Output per hour)
# We multiply employment by average hours to get total labor hours
df['labor_prod'] = df['rgdpna'] / (df['emp'] * df['avh'])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 3
      1 # 3. Labor Productivity (Output per hour)
      2 # We multiply employment by average hours to get total labor hours
----> 3 df['labor_prod'] = df['rgdpna'] / (df['emp'] * df['avh'])

NameError: name 'df' is not defined
# Display the first few rows to verify the new columns
df[['Country', 'year', 'gdp_pc', 'k_labor', 'labor_prod']].head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 # Display the first few rows to verify the new columns
----> 2 df[['Country', 'year', 'gdp_pc', 'k_labor', 'labor_prod']].head()

NameError: name 'df' is not defined

Exercise 4: Aggregation and Grouping

In this exercise, you will practice summarizing your data. Instead of looking at individual rows, we want to understand broader trends by country or by year.

Your Tasks:

  1. Global Trends: Calculate the total world population and the average TFP level for every year in the dataset.
  2. Country Profiles: For each country, find their maximum GDP per capita ('gdp_pc') and their average labor share ('labsh') across all available years.
  3. Productivity Trend: Group the data by year and calculate the mean Labor Productivity ('labor_prod').
  4. Summary Table: Create a new DataFrame called country_stats that shows the mean, min, and max of the Human Capital index ('hc') for every country.
# We group by 'year' and sum population, then average TFP
global_trends = df.groupby('year').agg({
    'pop': 'sum',
    'ctfp': 'mean'
}).reset_index()
global_trends.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 2
      1 # We group by 'year' and sum population, then average TFP
----> 2 global_trends = df.groupby('year').agg({
      3     'pop': 'sum',
      4     'ctfp': 'mean'
      5 }).reset_index()
      6 global_trends.head()

NameError: name 'df' is not defined
# Find the peak wealth and typical labor share for each nation
country_profiles = df.groupby('Country').agg({
    'gdp_pc': 'max',
    'labsh': 'mean'
}).reset_index()
country_profiles.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 2
      1 # Find the peak wealth and typical labor share for each nation
----> 2 country_profiles = df.groupby('Country').agg({
      3     'gdp_pc': 'max',
      4     'labsh': 'mean'
      5 }).reset_index()
      6 country_profiles.head()

NameError: name 'df' is not defined
# Labor Productivity over time
productivity_trend = df.groupby('year')['labor_prod'].mean()
productivity_trend.head()
# Optional: Quick visualization to check the trend
# productivity_trend.plot(title="Global Labor Productivity Over Time")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 2
      1 # Labor Productivity over time
----> 2 productivity_trend = df.groupby('year')['labor_prod'].mean()
      3 productivity_trend.head()
      4 # Optional: Quick visualization to check the trend
      5 # productivity_trend.plot(title="Global Labor Productivity Over Time")

NameError: name 'df' is not defined
# Using .agg() with a list of functions for a single column
country_stats = df.groupby('Country')['hc'].agg(['mean', 'min', 'max']).reset_index()
country_stats.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 2
      1 # Using .agg() with a list of functions for a single column
----> 2 country_stats = df.groupby('Country')['hc'].agg(['mean', 'min', 'max']).reset_index()
      3 country_stats.head()

NameError: name 'df' is not defined

Exercise 5: Visualization

In this exercise, you will use matplotlib to create figures that tell a story about global development and productivity.

Your Tasks:

  1. Import matplotlib
  2. The Rise of Human Capital: Create a line plot showing the average human capital (hc) for all countries over time. Requirement: Add a title, label your axes, and include a grid.
  3. Productivity vs. Wealth: Create a scatter plot for the year 2023 where the x-axis is hc (Human Capital) and the y-axis is gdp_pc (GDP per capita).
    Create the same figure again but use plt.yscale('log').
  4. The TFP Frontier: Use a horizontal bar chart (plt.barh) to show the ctfp (Total Factor Productivity) for a selection of countries in 2019: United States, China, India, Germany, Switzerland, Brazil, and Nigeria.
  5. Distribution of Labor Shares: Create a histogram of the labsh (Labor Share) variable for the entire dataset. Use 30 bins to see the shape of the distribution.
import matplotlib.pyplot as plt
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[19], line 1
----> 1 import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'matplotlib'
# 1. Calculate the average human capital for all countries by year
hc_trend = df.groupby('year')['hc'].mean()
hc_trend = hc_trend.reset_index()

# 2. Create the figure

# Initialize figure and set a figsize
plt.figure(figsize=(8, 5))
# plot year on x-axis and 'hc' on y axis
plt.plot(hc_trend['year'], hc_trend['hc'])
# Set a title and labels
plt.title("Global Average Human Capital Index (2000–2023)")
plt.xlabel("Year")
plt.ylabel("Average Human Capital Index")
# Add a grid
plt.grid(True)
# Show the figure
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 2
      1 # 1. Calculate the average human capital for all countries by year
----> 2 hc_trend = df.groupby('year')['hc'].mean()
      3 hc_trend = hc_trend.reset_index()
      5 # 2. Create the figure
      6 
      7 # Initialize figure and set a figsize

NameError: name 'df' is not defined
# Filter your data for 2023
df_2023 = df[df['year'] == 2023]


# Create the scatterplot 
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD)")
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 2
      1 # Filter your data for 2023
----> 2 df_2023 = df[df['year'] == 2023]
      5 # Create the scatterplot 
      6 plt.figure(figsize=(8, 6))

NameError: name 'df' is not defined
# Create the same plot with a log scale for the y-axis
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.yscale('log') # Log scale for GDP is standard in macroeconomics
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD, Log Scale)")
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 2
      1 # Create the same plot with a log scale for the y-axis
----> 2 plt.figure(figsize=(8, 6))
      3 plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
      4 plt.yscale('log') # Log scale for GDP is standard in macroeconomics

NameError: name 'plt' is not defined
# Filter the 2023 data from previously for the countries, and sort values
selection = ['United States', 'China', 'India', 'Germany','Switzerland', 'Brazil', 'Nigeria']
df_sel = df_2023[df_2023['Country'].isin(selection)]
df_sel = df_sel.sort_values('ctfp')

# Create horizontal barchart
plt.figure(figsize=(8, 5))
plt.barh(df_sel['Country'], df_sel['ctfp'])
plt.title("Total Factor Productivity Levels in 2023 (USA = 1)")
plt.xlabel("TFP Level (Relative to USA)")
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 3
      1 # Filter the 2023 data from previously for the countries, and sort values
      2 selection = ['United States', 'China', 'India', 'Germany','Switzerland', 'Brazil', 'Nigeria']
----> 3 df_sel = df_2023[df_2023['Country'].isin(selection)]
      4 df_sel = df_sel.sort_values('ctfp')
      6 # Create horizontal barchart

NameError: name 'df_2023' is not defined
plt.figure(figsize=(8, 5))
plt.hist(df['labsh'].dropna(), bins=30, edgecolor='white')
plt.title("Distribution of Labor Shares Across All Observations")
plt.xlabel("Labor Share of GDP")
plt.ylabel("Frequency")
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 plt.figure(figsize=(8, 5))
      2 plt.hist(df['labsh'].dropna(), bins=30, edgecolor='white')
      3 plt.title("Distribution of Labor Shares Across All Observations")

NameError: name 'plt' is not defined

Exercise 6: Growth Accounting - Self-Study

In this independent deep dive, you will apply the ‘Solow Residual’ method to decompose GDP growth into its fundamental drivers—capital, labor, and technology—transforming a core macroeconomic theory into a practical data analysis.

In this exercise, we assume a Cobb-Douglas production function:

\[Y_t = A_t K_t^{\alpha}L_t^{1-\alpha}\]

Where \(Y\) is output, \(K\) is capital, \(L\) is labor, and \(A\) is technology (TFP). Our goal is to calculate the growth rate of \(A\). We want to understand how much of a country’s GDP growth comes from the three components (i) labor, (ii) capital and (iii) technological progress.

1. Preparation

  • Create a new dataframe ga_df that includes only the United States and Switzerland
  • Sort values by country and year (important for growth calculations later)
# 1. Filter and Sort 
countries = ['United States', 'Switzerland']
ga_df = df[df['Country'].isin(countries)].copy()
ga_df = ga_df.sort_values(['Country', 'year'])
ga_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[25], line 3
      1 # 1. Filter and Sort 
      2 countries = ['United States', 'Switzerland']
----> 3 ga_df = df[df['Country'].isin(countries)].copy()
      4 ga_df = ga_df.sort_values(['Country', 'year'])
      5 ga_df.head()

NameError: name 'df' is not defined

2. Calculate Growth Rates

  • Create new variables:
    • g_y: GDP growth (growth rate of rgdpna)
    • g_l: Labor growth (growth rate of emp)
    • g_k: Capital growth (growth rate of rnna)

You can calculate growth rates with .pct_change(). It will calculate the percentage change of a value and the previous value in your dataframe. This is why sorting is important, we want to calculate the change from one year to the next.

Important: We don’t want to accidentally compare the last value of switzerland to the first value of the United States. To avoid this you should group by ‘Country’ before you calculate the growth rates.

ga_df = ga_df.sort_values(['Country', 'year'])
ga_df['g_y'] = ga_df.groupby('Country')['rgdpna'].pct_change()
ga_df['g_k'] = ga_df.groupby('Country')['rnna'].pct_change()
ga_df['g_l'] = ga_df.groupby('Country')['emp'].pct_change()
ga_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 ga_df = ga_df.sort_values(['Country', 'year'])
      2 ga_df['g_y'] = ga_df.groupby('Country')['rgdpna'].pct_change()
      3 ga_df['g_k'] = ga_df.groupby('Country')['rnna'].pct_change()

NameError: name 'ga_df' is not defined

3. Define Capital Share (\(\alpha\)):

A crucial parameter in the production function is the capital share \(\alpha\). We assume that this is constant.

  • Create a new variable capital_share_annual, which is 1 minus the labor share labsh
  • Calculate the average capital share for both countries, alpha_che for switzerland, alpha_usa for the United States
  • Assign the average capital share back to the dataframe. Use .loc

So far, you have used .loc to find specific data (e.g., df.loc[row, column]). However, .loc is also one of the most powerful tools for creating or updating variables based on a condition. When you use the assignment operator (=) with .loc, you are telling Python: “Find all the rows that meet my condition, look at this specific column, and write this value there.”

# Example 
#            [1. The Rows]               [2. The Column]    [3. The Value]
ga_df.loc[df['Country'] == 'United States', 'alpha']        =  0.1
  1. The Rows (The Filter): This identifies which rows you want to target (e.g., only the ones where the country is the United States).
  2. The Column (The Target): This is the name of the variable. If the column ‘alpha’ doesn’t exist yet, Pandas will create it for you automatically.
  3. The Value (The Assignment): This is the data you want to “paste” into those specific rows.
# create variable cap_share_annual
ga_df['cap_share_annual'] = 1 - ga_df['labsh']

# Calculate the average alpha for each country
alpha_usa = ga_df[ga_df['Country'] == 'United States']['cap_share_annual'].mean()
alpha_che = ga_df[ga_df['Country']=='Switzerland']['cap_share_annual'].mean()

# Assign the average alpha back to the dataframe
ga_df.loc[ga_df['Country'] == 'United States', 'alpha'] = alpha_usa
ga_df.loc[ga_df['Country'] == 'Switzerland', 'alpha'] = alpha_che
ga_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[27], line 2
      1 # create variable cap_share_annual
----> 2 ga_df['cap_share_annual'] = 1 - ga_df['labsh']
      4 # Calculate the average alpha for each country
      5 alpha_usa = ga_df[ga_df['Country'] == 'United States']['cap_share_annual'].mean()

NameError: name 'ga_df' is not defined

4. The Solow Residual:

We want to calculate TFP growth (i.e. the growth of \(A\) in the production function).

  • Calculate TFP growth using the following formula:

\[g_A​ = g_Y​ - [\alpha g_K​+(1−\alpha)g_L]​\]

  • \(g_L\): growth of labour g_l
  • \(g_K\): growth of capital g_k
  • \(g_Y\): growth of GDP g_y
  • \(\alpha\): capital share alpha

If you want to derive this formula: apply \(ln\) to the production function, then use the fact that the change in the natural log of a variable is approximately equal to its growth rate (g)

# 4. Calculate TFP Growth
ga_df['g_tfp'] = ga_df['g_y'] - (ga_df['alpha'] * ga_df['g_k'] + (1 - ga_df['alpha']) * ga_df['g_l'])
ga_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 2
      1 # 4. Calculate TFP Growth
----> 2 ga_df['g_tfp'] = ga_df['g_y'] - (ga_df['alpha'] * ga_df['g_k'] + (1 - ga_df['alpha']) * ga_df['g_l'])
      3 ga_df.head()

NameError: name 'ga_df' is not defined

5. Summarize and Plot

First we want to create a summary table

  • Create a table summary with the mean of the growth variables and alpha ['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp'] by country
  • The raw growth rates of Capital (g_k) and Labor (g_l) don’t tell the whole story. We must multiply them by their respective “shares” in the economy to see how much they actually contributed to total GDP growth:
    • Create a variable contrib_k: The contribution of capital which is \(\alpha g_K\)
    • Create a variable contrib_l: The contribution of labor which is \((1-\alpha)g_L\)
# 1. Average the components 
summary = ga_df.groupby('Country')[['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp']].mean()

# 2. Calculate the final contributions
summary['contrib_k'] = summary['alpha'] * summary['g_k']
summary['contrib_l'] = (1 - summary['alpha']) * summary['g_l']

# 3. Show the summary table
summary = summary.reset_index()
summary
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[29], line 2
      1 # 1. Average the components 
----> 2 summary = ga_df.groupby('Country')[['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp']].mean()
      4 # 2. Calculate the final contributions
      5 summary['contrib_k'] = summary['alpha'] * summary['g_k']

NameError: name 'ga_df' is not defined

Then we create a stacked barchart to visualize our results.

The code below creates a figure that decomposes average annual GDP growth into the specific contributions of capital accumulation, labor input, and Total Factor Productivity (TFP) to illustrate the underlying drivers of economic growth for each country.

We haven’t covered “stacked” charts in class, deduce how this works by looking at the three plt.bar() calls.

# 1. Setup the data from the columns
countries = summary['Country']
tfp_part = summary['g_tfp']
k_part = summary['contrib_k']
l_part = summary['contrib_l']

# 2. Create the figure
plt.figure(figsize=(8, 4.5))

# Plot Layer 1: TFP (The base)
plt.bar(countries, tfp_part, label='TFP Growth')

# Plot Layer 2: Capital (Stacked on top of TFP)
plt.bar(countries, k_part, bottom=tfp_part, label='Capital Contribution')

# Plot Layer 3: Labor (Stacked on top of TFP + Capital)
plt.bar(countries, l_part, bottom=tfp_part + k_part, label='Labor Contribution')

# 3. Add styling
plt.ylabel("Average Annual Growth Rate")
plt.title("Growth Accounting Decomposition")
plt.legend()

plt.tight_layout()
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[30], line 2
      1 # 1. Setup the data from the columns
----> 2 countries = summary['Country']
      3 tfp_part = summary['g_tfp']
      4 k_part = summary['contrib_k']

NameError: name 'summary' is not defined

We call plt.bar() three separate times. Each call tells Matplotlib to draw a set of bars on the same figure.

  • The first call draws the TFP bars.
  • The second call draws the Capital bars.
  • The third call draws the Labor bars.

The bottom Argument: This is the secret to “stacking.” By default, Matplotlib starts every bar at zero.

  • In the second call, we set bottom=tfp_part. This tells Python: “Don’t start the Capital bars at zero; start them at the height where the TFP bars ended.”
  • In the third call, we set bottom=tfp_part + k_part. This stacks the Labor bars on top of both the TFP and Capital layers combined.

Labels and Legend:

  • Inside each plt.bar() function, we provide a label (e.g., ‘TFP Growth’).
  • However, these labels won’t show up on the chart by themselves. We must call plt.legend() at the end. This command looks at all the labels we’ve defined and creates the “Key” or “Legend” in the corner of the chart so the reader knows which color represents which economic component.