Exercise 5: Pandas

Author

Franziska Bender

Published

March 14, 2026

In this exercise, we move from our simplified “teaching” dataset to a broader download from the Penn World Table (PWT). Download the data pwt_data_ex5.csv and save it in a folder data. This version contains more countries and a wider range of macroeconomic variables:

Variable Full Name / Definition
iso3 3-letter ISO Country Code
Country Full Country Name
year Observation Year
pop Population (in millions)
rgdpna Real GDP at constant 2021 national prices (in mil. 2021USD)
rnna Capital stock at constant 2021 national prices (in mil. 2021USD)
emp Number of persons engaged (in millions)
avh Average annual hours worked by persons engaged
hc Human capital index (based on years of schooling)
ctfp TFP level at current PPPs (USA = 1)
labsh Share of labor compensation in GDP
pl_con Price level of household consumption

Import Packages

import pandas as pd

Exercise 1: Data Loading and Initial Inspection

In this exercise, you will practice loading a comprehensive macroeconomic dataset into a DataFrame and performing an initial inspection to understand its dimensionality, variable types, and the unique entities it contains.

Your Tasks:

  1. Load the data: Read the file data/pwt_data_ex5.csv into a DataFrame named df.
  2. Size Check: How many observations (rows) and variables (columns) does this dataset have?
  3. Variable Types: Use a method to see which columns are categorical (strings) and which are numeric.
  4. Country Count: How many unique countries are represented in this dataset?
  5. Summary Statistics: Display the mean and standard deviation for the numeric variables.
# 1. Load the data
df = pd.read_csv("data/pwt_data_ex5.csv")
df.head()
iso3 Country year avh ctfp emp hc labsh pl_con pop rgdpna rnna
0 ABW Aruba 2000 NaN NaN 0.041918 NaN 0.645106 0.503648 0.088761 3982.696289 11437.366211
1 ABW Aruba 2001 NaN NaN 0.042579 NaN 0.645106 0.518854 0.090305 3918.971436 11965.093750
2 ABW Aruba 2002 NaN NaN 0.043016 NaN 0.645106 0.532409 0.091379 3923.781006 12594.469727
3 ABW Aruba 2003 NaN NaN 0.043385 NaN 0.645106 0.538941 0.092310 3944.353027 13318.708008
4 ABW Aruba 2004 NaN NaN 0.043739 NaN 0.645106 0.552747 0.093213 4243.611328 14101.028320

You’ll notice immediately the NaN entries in the dataframe. In pandas, NaN stands for “Not a Number” and is commonly used to represent missing or undefined data in a Series or DataFrame.

For this exercise we’re going to ignore them. Handling missing data is something we’ll cover in week 6.

# 2. Size Check
print(f"Dataset Shape: {df.shape}")
# df.shape returns (rows, columns)
Dataset Shape: (4422, 12)
# 3. Variable Types and Non-Null counts
# .info() is perfect for seeing Dtypes and identifying missing data at a glance
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4422 entries, 0 to 4421
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   iso3     4422 non-null   object 
 1   Country  4422 non-null   object 
 2   year     4422 non-null   int64  
 3   avh      2812 non-null   float64
 4   ctfp     2880 non-null   float64
 5   emp      4302 non-null   float64
 6   hc       3480 non-null   float64
 7   labsh    3350 non-null   float64
 8   pl_con   4422 non-null   float64
 9   pop      4422 non-null   float64
 10  rgdpna   4422 non-null   float64
 11  rnna     4320 non-null   float64
dtypes: float64(9), int64(1), object(2)
memory usage: 414.7+ KB
# 4. Country Count
# We use .nunique() on the 'Country' column to count unique entries
num_countries = df['Country'].nunique()
print(f"There are {num_countries} countries in this dataset.")
There are 185 countries in this dataset.
# 5. Summary Statistics
# .describe() gives us the mean, std, and quartiles for numeric columns
df.describe()
year avh ctfp emp hc labsh pl_con pop rgdpna rnna
count 4422.000000 2812.000000 2880.000000 4302.000000 3480.000000 3350.000000 4422.000000 4422.000000 4.422000e+03 4.320000e+03
mean 2011.535957 1955.190185 0.661380 16.943444 2.545427 0.503176 0.528296 38.285502 6.300825e+05 2.883834e+06
std 6.912838 313.353559 0.262749 68.968464 0.692214 0.122610 1.265803 141.094358 2.181730e+06 9.505130e+06
min 2000.000000 1313.570000 0.051811 0.001760 1.069451 0.084364 0.065918 0.004420 6.765967e+01 7.886124e+02
25% 2006.000000 1694.320000 0.458319 0.829782 1.976296 0.433389 0.305083 2.041515 1.942828e+04 6.750847e+04
50% 2012.000000 1947.600000 0.646390 3.390750 2.631051 0.516634 0.410670 8.092902 6.798322e+04 2.887282e+05
75% 2018.000000 2178.300000 0.844167 10.190027 3.113224 0.591901 0.584906 26.208164 3.798980e+05 1.654450e+06
max 2023.000000 2706.890000 2.382832 774.418213 3.986023 0.911995 56.723629 1438.069596 3.104758e+07 1.710951e+08

Exercise 2: Basic Filtering

In this exercise, you will practice extracting specific subsets of data from your main DataFrame df.

Your Tasks:

  1. The Recent Slice: Create a new DataFrame called df_2023 that contains only observations for the year 2023.
  2. The Country Focus: Create a new DataFrame called df_switzerland that contains all years of data for Switzerland.
  3. High-Population Economies: Filter the dataset to find all observations where the population ('pop') was greater than 1,000 million (1 billion), which countries remain?
  4. Specific Comparison: Create a DataFrame called df_comparison that contains only the data for Germany and France for the years 2000 to 2010 (inclusive). Hint: You will need to use the & operator and parentheses for multiple conditions.
# 1. The Recent Slice
df_2023 = df[df['year'] == 2023]
# 2. The Country Focus
df_switzerland = df[df['Country'] == 'Switzerland']
# 3. High-Population Economies
high_pop = df[df['pop'] > 1000]
high_pop['Country'].unique()
array(['China', 'India'], dtype=object)
# 4. Specific Comparison
# We use .isin() for the countries and logical operators for the years
countries = ['Germany', 'France']
mask = (df['Country'].isin(countries)) & (df['year'] >= 2000) & (df['year'] <= 2010)
df_comparison = df[mask]
print(f"Size of comparison group: {len(df_comparison)}")
Size of comparison group: 22

Exercise 3: Creating New Variables

In this exercise, you will create derived variables to help analyze productivity and living standards across different countries.

Your Tasks:

  1. GDP per Capita: Create a new column called gdp_pc by dividing Real GDP (rgdpna) by Population (pop).

  2. Capital Intensity: Create a column called k_labor which represents the amount of Capital Stock (rnna) available per person engaged (emp).

  3. Labor Productivity: Create a column called labor_prod by dividing Real GDP (rgdpna) by the total number of hours worked (emp x avh).

    Note: avh has many missing values, which will result in NaN for those specific rows in your new column. We’ll learn how to deal with missing values next week.

  4. Display Results: Show the first 5 rows of the newly created variables, country and year.

# 1. GDP per Capita
# Formula: Real GDP / Population
df['gdp_pc'] = df['rgdpna'] / df['pop']
# 2. Capital Intensity (Capital per worker)
# Formula: Capital Stock / Employment
df['k_labor'] = df['rnna'] / df['emp']
# 3. Labor Productivity (Output per hour)
# We multiply employment by average hours to get total labor hours
df['labor_prod'] = df['rgdpna'] / (df['emp'] * df['avh'])
# Display the first few rows to verify the new columns
df[['Country', 'year', 'gdp_pc', 'k_labor', 'labor_prod']].head()
Country year gdp_pc k_labor labor_prod
0 Aruba 2000 44869.889806 272850.962900 NaN
1 Aruba 2001 43397.059250 281009.204154 NaN
2 Aruba 2002 42939.636086 292782.673191 NaN
3 Aruba 2003 42729.422894 306988.439240 NaN
4 Aruba 2004 45525.960200 322389.519304 NaN

Exercise 4: Aggregation and Grouping

In this exercise, you will practice summarizing your data. Instead of looking at individual rows, we want to understand broader trends by country or by year.

Your Tasks:

  1. Global Trends: Calculate the total world population and the average TFP level for every year in the dataset.
  2. Country Profiles: For each country, find their maximum GDP per capita ('gdp_pc') and their average labor share ('labsh') across all available years.
  3. Productivity Trend: Group the data by year and calculate the mean Labor Productivity ('labor_prod').
  4. Summary Table: Create a new DataFrame called country_stats that shows the mean, min, and max of the Human Capital index ('hc') for every country.
# We group by 'year' and sum population, then average TFP
global_trends = df.groupby('year').agg({
    'pop': 'sum',
    'ctfp': 'mean'
}).reset_index()
global_trends.head()
year pop ctfp
0 2000 6085.879992 0.663150
1 2001 6168.092024 0.664814
2 2002 6248.875966 0.672702
3 2003 6329.244726 0.662922
4 2004 6410.522290 0.663620
# Find the peak wealth and typical labor share for each nation
country_profiles = df.groupby('Country').agg({
    'gdp_pc': 'max',
    'labsh': 'mean'
}).reset_index()
country_profiles.head()
Country gdp_pc labsh
0 Albania 16316.157654 0.806560
1 Algeria 14541.418972 NaN
2 Angola 8631.597586 0.284984
3 Anguilla 31024.327373 NaN
4 Antigua and Barbuda 32240.787743 NaN
# Labor Productivity over time
productivity_trend = df.groupby('year')['labor_prod'].mean()
productivity_trend.head()
# Optional: Quick visualization to check the trend
# productivity_trend.plot(title="Global Labor Productivity Over Time")
year
2000    33.359983
2001    33.657953
2002    34.408001
2003    35.299476
2004    36.264415
Name: labor_prod, dtype: float64
# Using .agg() with a list of functions for a single column
country_stats = df.groupby('Country')['hc'].agg(['mean', 'min', 'max']).reset_index()
country_stats.head()
Country mean min max
0 Albania 2.929273 2.801147 3.018473
1 Algeria 2.145182 1.885080 2.508918
2 Angola 1.419252 1.296941 1.544955
3 Anguilla NaN NaN NaN
4 Antigua and Barbuda NaN NaN NaN

Exercise 5: Visualization

In this exercise, you will use matplotlib to create figures that tell a story about global development and productivity.

Your Tasks:

  1. Import matplotlib
  2. The Rise of Human Capital: Create a line plot showing the average human capital (hc) for all countries over time. Requirement: Add a title, label your axes, and include a grid.
  3. Productivity vs. Wealth: Create a scatter plot for the year 2023 where the x-axis is hc (Human Capital) and the y-axis is gdp_pc (GDP per capita).
    Create the same figure again but use plt.yscale('log').
  4. The TFP Frontier: Use a horizontal bar chart (plt.barh) to show the ctfp (Total Factor Productivity) for a selection of countries in 2019: United States, China, India, Germany, Switzerland, Brazil, and Nigeria.
  5. Distribution of Labor Shares: Create a histogram of the labsh (Labor Share) variable for the entire dataset. Use 30 bins to see the shape of the distribution.
import matplotlib.pyplot as plt
# 1. Calculate the average human capital for all countries by year
hc_trend = df.groupby('year')['hc'].mean()
hc_trend = hc_trend.reset_index()

# 2. Create the figure

# Initialize figure and set a figsize
plt.figure(figsize=(8, 5))
# plot year on x-axis and 'hc' on y axis
plt.plot(hc_trend['year'], hc_trend['hc'])
# Set a title and labels
plt.title("Global Average Human Capital Index (2000–2023)")
plt.xlabel("Year")
plt.ylabel("Average Human Capital Index")
# Add a grid
plt.grid(True)
# Show the figure
plt.show()

# Filter your data for 2023
df_2023 = df[df['year'] == 2023]


# Create the scatterplot 
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD)")
plt.show()

# Create the same plot with a log scale for the y-axis
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.yscale('log') # Log scale for GDP is standard in macroeconomics
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD, Log Scale)")
plt.show()

# Filter the 2023 data from previously for the countries, and sort values
selection = ['United States', 'China', 'India', 'Germany','Switzerland', 'Brazil', 'Nigeria']
df_sel = df_2023[df_2023['Country'].isin(selection)]
df_sel = df_sel.sort_values('ctfp')

# Create horizontal barchart
plt.figure(figsize=(8, 5))
plt.barh(df_sel['Country'], df_sel['ctfp'])
plt.title("Total Factor Productivity Levels in 2023 (USA = 1)")
plt.xlabel("TFP Level (Relative to USA)")
plt.show()

plt.figure(figsize=(8, 5))
plt.hist(df['labsh'].dropna(), bins=30, edgecolor='white')
plt.title("Distribution of Labor Shares Across All Observations")
plt.xlabel("Labor Share of GDP")
plt.ylabel("Frequency")
plt.show()

Exercise 6: Growth Accounting - Self-Study

In this independent deep dive, you will apply the ‘Solow Residual’ method to decompose GDP growth into its fundamental drivers—capital, labor, and technology—transforming a core macroeconomic theory into a practical data analysis.

In this exercise, we assume a Cobb-Douglas production function:

\[Y_t = A_t K_t^{\alpha}L_t^{1-\alpha}\]

Where \(Y\) is output, \(K\) is capital, \(L\) is labor, and \(A\) is technology (TFP). Our goal is to calculate the growth rate of \(A\). We want to understand how much of a country’s GDP growth comes from the three components (i) labor, (ii) capital and (iii) technological progress.

1. Preparation

  • Create a new dataframe ga_df that includes only the United States and Switzerland
  • Sort values by country and year (important for growth calculations later)
# 1. Filter and Sort 
countries = ['United States', 'Switzerland']
ga_df = df[df['Country'].isin(countries)].copy()
ga_df = ga_df.sort_values(['Country', 'year'])
ga_df.head()
iso3 Country year avh ctfp emp hc labsh pl_con pop rgdpna rnna gdp_pc k_labor labor_prod
744 CHE Switzerland 2000 1712.74 0.914894 3.944163 3.528448 0.654686 0.774927 7.184007 529301.6875 2352489.00 73677.780033 596448.279214 78.353252
745 CHE Switzerland 2001 1672.85 0.933410 4.013566 3.538948 0.670151 0.778998 7.226391 537641.6875 2405543.00 74399.750512 599352.970383 80.076572
746 CHE Switzerland 2002 1651.26 0.960231 4.042648 3.549480 0.684949 0.837000 7.278752 537248.0625 2456137.25 73810.464005 607556.497216 80.481013
747 CHE Switzerland 2003 1664.54 0.929821 4.029146 3.560044 0.677805 1.008451 7.333447 537074.0000 2503626.50 73236.228475 621378.990849 80.080526
748 CHE Switzerland 2004 1694.95 0.933693 4.041221 3.570638 0.666347 1.112374 7.384194 551584.1250 2557653.75 74697.946045 632891.311870 80.527136

2. Calculate Growth Rates

  • Create new variables:
    • g_y: GDP growth (growth rate of rgdpna)
    • g_l: Labor growth (growth rate of emp)
    • g_k: Capital growth (growth rate of rnna)

You can calculate growth rates with .pct_change(). It will calculate the percentage change of a value and the previous value in your dataframe. This is why sorting is important, we want to calculate the change from one year to the next.

Important: We don’t want to accidentally compare the last value of switzerland to the first value of the United States. To avoid this you should group by ‘Country’ before you calculate the growth rates.

ga_df = ga_df.sort_values(['Country', 'year'])
ga_df['g_y'] = ga_df.groupby('Country')['rgdpna'].pct_change()
ga_df['g_k'] = ga_df.groupby('Country')['rnna'].pct_change()
ga_df['g_l'] = ga_df.groupby('Country')['emp'].pct_change()
ga_df.head()
iso3 Country year avh ctfp emp hc labsh pl_con pop rgdpna rnna gdp_pc k_labor labor_prod g_y g_k g_l
744 CHE Switzerland 2000 1712.74 0.914894 3.944163 3.528448 0.654686 0.774927 7.184007 529301.6875 2352489.00 73677.780033 596448.279214 78.353252 NaN NaN NaN
745 CHE Switzerland 2001 1672.85 0.933410 4.013566 3.538948 0.670151 0.778998 7.226391 537641.6875 2405543.00 74399.750512 599352.970383 80.076572 0.015757 0.022552 0.017597
746 CHE Switzerland 2002 1651.26 0.960231 4.042648 3.549480 0.684949 0.837000 7.278752 537248.0625 2456137.25 73810.464005 607556.497216 80.481013 -0.000732 0.021032 0.007246
747 CHE Switzerland 2003 1664.54 0.929821 4.029146 3.560044 0.677805 1.008451 7.333447 537074.0000 2503626.50 73236.228475 621378.990849 80.080526 -0.000324 0.019335 -0.003340
748 CHE Switzerland 2004 1694.95 0.933693 4.041221 3.570638 0.666347 1.112374 7.384194 551584.1250 2557653.75 74697.946045 632891.311870 80.527136 0.027017 0.021580 0.002997

3. Define Capital Share (\(\alpha\)):

A crucial parameter in the production function is the capital share \(\alpha\). We assume that this is constant.

  • Create a new variable capital_share_annual, which is 1 minus the labor share labsh
  • Calculate the average capital share for both countries, alpha_che for switzerland, alpha_usa for the United States
  • Assign the average capital share back to the dataframe. Use .loc

So far, you have used .loc to find specific data (e.g., df.loc[row, column]). However, .loc is also one of the most powerful tools for creating or updating variables based on a condition. When you use the assignment operator (=) with .loc, you are telling Python: “Find all the rows that meet my condition, look at this specific column, and write this value there.”

# Example 
#            [1. The Rows]               [2. The Column]    [3. The Value]
ga_df.loc[df['Country'] == 'United States', 'alpha']        =  0.1
  1. The Rows (The Filter): This identifies which rows you want to target (e.g., only the ones where the country is the United States).
  2. The Column (The Target): This is the name of the variable. If the column ‘alpha’ doesn’t exist yet, Pandas will create it for you automatically.
  3. The Value (The Assignment): This is the data you want to “paste” into those specific rows.
# create variable cap_share_annual
ga_df['cap_share_annual'] = 1 - ga_df['labsh']

# Calculate the average alpha for each country
alpha_usa = ga_df[ga_df['Country'] == 'United States']['cap_share_annual'].mean()
alpha_che = ga_df[ga_df['Country']=='Switzerland']['cap_share_annual'].mean()

# Assign the average alpha back to the dataframe
ga_df.loc[ga_df['Country'] == 'United States', 'alpha'] = alpha_usa
ga_df.loc[ga_df['Country'] == 'Switzerland', 'alpha'] = alpha_che
ga_df.head()
iso3 Country year avh ctfp emp hc labsh pl_con pop rgdpna rnna gdp_pc k_labor labor_prod g_y g_k g_l cap_share_annual alpha
744 CHE Switzerland 2000 1712.74 0.914894 3.944163 3.528448 0.654686 0.774927 7.184007 529301.6875 2352489.00 73677.780033 596448.279214 78.353252 NaN NaN NaN 0.345314 0.341327
745 CHE Switzerland 2001 1672.85 0.933410 4.013566 3.538948 0.670151 0.778998 7.226391 537641.6875 2405543.00 74399.750512 599352.970383 80.076572 0.015757 0.022552 0.017597 0.329849 0.341327
746 CHE Switzerland 2002 1651.26 0.960231 4.042648 3.549480 0.684949 0.837000 7.278752 537248.0625 2456137.25 73810.464005 607556.497216 80.481013 -0.000732 0.021032 0.007246 0.315051 0.341327
747 CHE Switzerland 2003 1664.54 0.929821 4.029146 3.560044 0.677805 1.008451 7.333447 537074.0000 2503626.50 73236.228475 621378.990849 80.080526 -0.000324 0.019335 -0.003340 0.322195 0.341327
748 CHE Switzerland 2004 1694.95 0.933693 4.041221 3.570638 0.666347 1.112374 7.384194 551584.1250 2557653.75 74697.946045 632891.311870 80.527136 0.027017 0.021580 0.002997 0.333653 0.341327

4. The Solow Residual:

We want to calculate TFP growth (i.e. the growth of \(A\) in the production function).

  • Calculate TFP growth using the following formula:

\[g_A​ = g_Y​ - [\alpha g_K​+(1−\alpha)g_L]​\]

  • \(g_L\): growth of labour g_l
  • \(g_K\): growth of capital g_k
  • \(g_Y\): growth of GDP g_y
  • \(\alpha\): capital share alpha

If you want to derive this formula: apply \(ln\) to the production function, then use the fact that the change in the natural log of a variable is approximately equal to its growth rate (g)

# 4. Calculate TFP Growth
ga_df['g_tfp'] = ga_df['g_y'] - (ga_df['alpha'] * ga_df['g_k'] + (1 - ga_df['alpha']) * ga_df['g_l'])
ga_df.head()
iso3 Country year avh ctfp emp hc labsh pl_con pop ... rnna gdp_pc k_labor labor_prod g_y g_k g_l cap_share_annual alpha g_tfp
744 CHE Switzerland 2000 1712.74 0.914894 3.944163 3.528448 0.654686 0.774927 7.184007 ... 2352489.00 73677.780033 596448.279214 78.353252 NaN NaN NaN 0.345314 0.341327 NaN
745 CHE Switzerland 2001 1672.85 0.933410 4.013566 3.538948 0.670151 0.778998 7.226391 ... 2405543.00 74399.750512 599352.970383 80.076572 0.015757 0.022552 0.017597 0.329849 0.341327 -0.003532
746 CHE Switzerland 2002 1651.26 0.960231 4.042648 3.549480 0.684949 0.837000 7.278752 ... 2456137.25 73810.464005 607556.497216 80.481013 -0.000732 0.021032 0.007246 0.315051 0.341327 -0.012684
747 CHE Switzerland 2003 1664.54 0.929821 4.029146 3.560044 0.677805 1.008451 7.333447 ... 2503626.50 73236.228475 621378.990849 80.080526 -0.000324 0.019335 -0.003340 0.322195 0.341327 -0.004724
748 CHE Switzerland 2004 1694.95 0.933693 4.041221 3.570638 0.666347 1.112374 7.384194 ... 2557653.75 74697.946045 632891.311870 80.527136 0.027017 0.021580 0.002997 0.333653 0.341327 0.017677

5 rows × 21 columns

5. Summarize and Plot

First we want to create a summary table

  • Create a table summary with the mean of the growth variables and alpha ['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp'] by country
  • The raw growth rates of Capital (g_k) and Labor (g_l) don’t tell the whole story. We must multiply them by their respective “shares” in the economy to see how much they actually contributed to total GDP growth:
    • Create a variable contrib_k: The contribution of capital which is \(\alpha g_K\)
    • Create a variable contrib_l: The contribution of labor which is \((1-\alpha)g_L\)
# 1. Average the components 
summary = ga_df.groupby('Country')[['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp']].mean()

# 2. Calculate the final contributions
summary['contrib_k'] = summary['alpha'] * summary['g_k']
summary['contrib_l'] = (1 - summary['alpha']) * summary['g_l']

# 3. Show the summary table
summary = summary.reset_index()
summary
Country g_y g_k g_l alpha g_tfp contrib_k contrib_l
0 Switzerland 0.018271 0.020339 0.012318 0.341327 0.003215 0.006942 0.008113
1 United States 0.021028 0.019545 0.008397 0.400359 0.008168 0.007825 0.005035

Then we create a stacked barchart to visualize our results.

The code below creates a figure that decomposes average annual GDP growth into the specific contributions of capital accumulation, labor input, and Total Factor Productivity (TFP) to illustrate the underlying drivers of economic growth for each country.

We haven’t covered “stacked” charts in class, deduce how this works by looking at the three plt.bar() calls.

# 1. Setup the data from the columns
countries = summary['Country']
tfp_part = summary['g_tfp']
k_part = summary['contrib_k']
l_part = summary['contrib_l']

# 2. Create the figure
plt.figure(figsize=(8, 4.5))

# Plot Layer 1: TFP (The base)
plt.bar(countries, tfp_part, label='TFP Growth')

# Plot Layer 2: Capital (Stacked on top of TFP)
plt.bar(countries, k_part, bottom=tfp_part, label='Capital Contribution')

# Plot Layer 3: Labor (Stacked on top of TFP + Capital)
plt.bar(countries, l_part, bottom=tfp_part + k_part, label='Labor Contribution')

# 3. Add styling
plt.ylabel("Average Annual Growth Rate")
plt.title("Growth Accounting Decomposition")
plt.legend()

plt.tight_layout()
plt.show()

We call plt.bar() three separate times. Each call tells Matplotlib to draw a set of bars on the same figure.

  • The first call draws the TFP bars.
  • The second call draws the Capital bars.
  • The third call draws the Labor bars.

The bottom Argument: This is the secret to “stacking.” By default, Matplotlib starts every bar at zero.

  • In the second call, we set bottom=tfp_part. This tells Python: “Don’t start the Capital bars at zero; start them at the height where the TFP bars ended.”
  • In the third call, we set bottom=tfp_part + k_part. This stacks the Labor bars on top of both the TFP and Capital layers combined.

Labels and Legend:

  • Inside each plt.bar() function, we provide a label (e.g., ‘TFP Growth’).
  • However, these labels won’t show up on the chart by themselves. We must call plt.legend() at the end. This command looks at all the labels we’ve defined and creates the “Key” or “Legend” in the corner of the chart so the reader knows which color represents which economic component.