Exercise 5: Pandas

Author

Franziska Bender

Published

March 14, 2026

In this exercise, we move from our simplified “teaching” dataset to a broader download from the Penn World Table (PWT). Download the data pwt_data_ex5.csv and save it in a folder data. This version contains more countries and a wider range of macroeconomic variables:

Variable	Full Name / Definition
iso3	3-letter ISO Country Code
Country	Full Country Name
year	Observation Year
pop	Population (in millions)
rgdpna	Real GDP at constant 2021 national prices (in mil. 2021USD)
rnna	Capital stock at constant 2021 national prices (in mil. 2021USD)
emp	Number of persons engaged (in millions)
avh	Average annual hours worked by persons engaged
hc	Human capital index (based on years of schooling)
ctfp	TFP level at current PPPs (USA = 1)
labsh	Share of labor compensation in GDP
pl_con	Price level of household consumption

Import Packages

import pandas as pd

Exercise 1: Data Loading and Initial Inspection

In this exercise, you will practice loading a comprehensive macroeconomic dataset into a DataFrame and performing an initial inspection to understand its dimensionality, variable types, and the unique entities it contains.

Your Tasks:

Load the data: Read the file data/pwt_data_ex5.csv into a DataFrame named df.
Size Check: How many observations (rows) and variables (columns) does this dataset have?
Variable Types: Use a method to see which columns are categorical (strings) and which are numeric.
Country Count: How many unique countries are represented in this dataset?
Summary Statistics: Display the mean and standard deviation for the numeric variables.

Solution 1.1: Load the Data

# 1. Load the data
df = pd.read_csv("data/pwt_data_ex5.csv")
df.head()

	iso3	Country	year	avh	ctfp	emp	hc	labsh	pl_con	pop	rgdpna	rnna
0	ABW	Aruba	2000	NaN	NaN	0.041918	NaN	0.645106	0.503648	0.088761	3982.696289	11437.366211
1	ABW	Aruba	2001	NaN	NaN	0.042579	NaN	0.645106	0.518854	0.090305	3918.971436	11965.093750
2	ABW	Aruba	2002	NaN	NaN	0.043016	NaN	0.645106	0.532409	0.091379	3923.781006	12594.469727
3	ABW	Aruba	2003	NaN	NaN	0.043385	NaN	0.645106	0.538941	0.092310	3944.353027	13318.708008
4	ABW	Aruba	2004	NaN	NaN	0.043739	NaN	0.645106	0.552747	0.093213	4243.611328	14101.028320

You’ll notice immediately the NaN entries in the dataframe. In pandas, NaN stands for “Not a Number” and is commonly used to represent missing or undefined data in a Series or DataFrame.

For this exercise we’re going to ignore them. Handling missing data is something we’ll cover in week 6.

Solution 1.2: Size Check

# 2. Size Check
print(f"Dataset Shape: {df.shape}")
# df.shape returns (rows, columns)

Dataset Shape: (4422, 12)

Solution 1.3: Variable Types

# 3. Variable Types and Non-Null counts
# .info() is perfect for seeing Dtypes and identifying missing data at a glance
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4422 entries, 0 to 4421
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   iso3     4422 non-null   object 
 1   Country  4422 non-null   object 
 2   year     4422 non-null   int64  
 3   avh      2812 non-null   float64
 4   ctfp     2880 non-null   float64
 5   emp      4302 non-null   float64
 6   hc       3480 non-null   float64
 7   labsh    3350 non-null   float64
 8   pl_con   4422 non-null   float64
 9   pop      4422 non-null   float64
 10  rgdpna   4422 non-null   float64
 11  rnna     4320 non-null   float64
dtypes: float64(9), int64(1), object(2)
memory usage: 414.7+ KB

Solution 1.4: Country Count

# 4. Country Count
# We use .nunique() on the 'Country' column to count unique entries
num_countries = df['Country'].nunique()
print(f"There are {num_countries} countries in this dataset.")

There are 185 countries in this dataset.

Solution 1.5: Summary Statistics

# 5. Summary Statistics
# .describe() gives us the mean, std, and quartiles for numeric columns
df.describe()

	year	avh	ctfp	emp	hc	labsh	pl_con	pop	rgdpna	rnna
count	4422.000000	2812.000000	2880.000000	4302.000000	3480.000000	3350.000000	4422.000000	4422.000000	4.422000e+03	4.320000e+03
mean	2011.535957	1955.190185	0.661380	16.943444	2.545427	0.503176	0.528296	38.285502	6.300825e+05	2.883834e+06
std	6.912838	313.353559	0.262749	68.968464	0.692214	0.122610	1.265803	141.094358	2.181730e+06	9.505130e+06
min	2000.000000	1313.570000	0.051811	0.001760	1.069451	0.084364	0.065918	0.004420	6.765967e+01	7.886124e+02
25%	2006.000000	1694.320000	0.458319	0.829782	1.976296	0.433389	0.305083	2.041515	1.942828e+04	6.750847e+04
50%	2012.000000	1947.600000	0.646390	3.390750	2.631051	0.516634	0.410670	8.092902	6.798322e+04	2.887282e+05
75%	2018.000000	2178.300000	0.844167	10.190027	3.113224	0.591901	0.584906	26.208164	3.798980e+05	1.654450e+06
max	2023.000000	2706.890000	2.382832	774.418213	3.986023	0.911995	56.723629	1438.069596	3.104758e+07	1.710951e+08

Exercise 2: Basic Filtering

In this exercise, you will practice extracting specific subsets of data from your main DataFrame df.

Your Tasks:

The Recent Slice: Create a new DataFrame called df_2023 that contains only observations for the year 2023.
The Country Focus: Create a new DataFrame called df_switzerland that contains all years of data for Switzerland.
High-Population Economies: Filter the dataset to find all observations where the population ('pop') was greater than 1,000 million (1 billion), which countries remain?
Specific Comparison: Create a DataFrame called df_comparison that contains only the data for Germany and France for the years 2000 to 2010 (inclusive). Hint: You will need to use the & operator and parentheses for multiple conditions.

Solution 2.1: The Recent Slice

# 1. The Recent Slice
df_2023 = df[df['year'] == 2023]

Solution 2.2: The Country Focus

# 2. The Country Focus
df_switzerland = df[df['Country'] == 'Switzerland']

Solution 2.3: High-Population Economies

# 3. High-Population Economies
high_pop = df[df['pop'] > 1000]
high_pop['Country'].unique()

array(['China', 'India'], dtype=object)

Solution 2.4: Specific Comparison

# 4. Specific Comparison
# We use .isin() for the countries and logical operators for the years
countries = ['Germany', 'France']
mask = (df['Country'].isin(countries)) & (df['year'] >= 2000) & (df['year'] <= 2010)
df_comparison = df[mask]
print(f"Size of comparison group: {len(df_comparison)}")

Size of comparison group: 22

Exercise 3: Creating New Variables

In this exercise, you will create derived variables to help analyze productivity and living standards across different countries.

Your Tasks:

GDP per Capita: Create a new column called gdp_pc by dividing Real GDP (rgdpna) by Population (pop).
Capital Intensity: Create a column called k_labor which represents the amount of Capital Stock (rnna) available per person engaged (emp).
Labor Productivity: Create a column called labor_prod by dividing Real GDP (rgdpna) by the total number of hours worked (emp x avh).

Note: avh has many missing values, which will result in NaN for those specific rows in your new column. We’ll learn how to deal with missing values next week.
Display Results: Show the first 5 rows of the newly created variables, country and year.

Solution 3.1: GDP per Capita

# 1. GDP per Capita
# Formula: Real GDP / Population
df['gdp_pc'] = df['rgdpna'] / df['pop']

Solution 3.2: Capital Intensity

# 2. Capital Intensity (Capital per worker)
# Formula: Capital Stock / Employment
df['k_labor'] = df['rnna'] / df['emp']

Solution 3.3: Labor Productivity

# 3. Labor Productivity (Output per hour)
# We multiply employment by average hours to get total labor hours
df['labor_prod'] = df['rgdpna'] / (df['emp'] * df['avh'])

Solution 3.4: Display Results

# Display the first few rows to verify the new columns
df[['Country', 'year', 'gdp_pc', 'k_labor', 'labor_prod']].head()

	Country	year	gdp_pc	k_labor	labor_prod
0	Aruba	2000	44869.889806	272850.962900	NaN
1	Aruba	2001	43397.059250	281009.204154	NaN
2	Aruba	2002	42939.636086	292782.673191	NaN
3	Aruba	2003	42729.422894	306988.439240	NaN
4	Aruba	2004	45525.960200	322389.519304	NaN

Exercise 4: Aggregation and Grouping

In this exercise, you will practice summarizing your data. Instead of looking at individual rows, we want to understand broader trends by country or by year.

Your Tasks:

Global Trends: Calculate the total world population and the average TFP level for every year in the dataset.
Country Profiles: For each country, find their maximum GDP per capita ('gdp_pc') and their average labor share ('labsh') across all available years.
Productivity Trend: Group the data by year and calculate the mean Labor Productivity ('labor_prod').
Summary Table: Create a new DataFrame called country_stats that shows the mean, min, and max of the Human Capital index ('hc') for every country.

Solution 4.1: Global Trends

# We group by 'year' and sum population, then average TFP
global_trends = df.groupby('year').agg({
    'pop': 'sum',
    'ctfp': 'mean'
}).reset_index()
global_trends.head()

	year	pop	ctfp
0	2000	6085.879992	0.663150
1	2001	6168.092024	0.664814
2	2002	6248.875966	0.672702
3	2003	6329.244726	0.662922
4	2004	6410.522290	0.663620

Solution 4.2: Country Profiles

# Find the peak wealth and typical labor share for each nation
country_profiles = df.groupby('Country').agg({
    'gdp_pc': 'max',
    'labsh': 'mean'
}).reset_index()
country_profiles.head()

	Country	gdp_pc	labsh
0	Albania	16316.157654	0.806560
1	Algeria	14541.418972	NaN
2	Angola	8631.597586	0.284984
3	Anguilla	31024.327373	NaN
4	Antigua and Barbuda	32240.787743	NaN

Solution 4.3: Productivity Trend

# Labor Productivity over time
productivity_trend = df.groupby('year')['labor_prod'].mean()
productivity_trend.head()
# Optional: Quick visualization to check the trend
# productivity_trend.plot(title="Global Labor Productivity Over Time")

year
2000    33.359983
2001    33.657953
2002    34.408001
2003    35.299476
2004    36.264415
Name: labor_prod, dtype: float64

Solution 4.4: Summary Table

# Using .agg() with a list of functions for a single column
country_stats = df.groupby('Country')['hc'].agg(['mean', 'min', 'max']).reset_index()
country_stats.head()

	Country	mean	min	max
0	Albania	2.929273	2.801147	3.018473
1	Algeria	2.145182	1.885080	2.508918
2	Angola	1.419252	1.296941	1.544955
3	Anguilla	NaN	NaN	NaN
4	Antigua and Barbuda	NaN	NaN	NaN

Exercise 5: Visualization

In this exercise, you will use matplotlib to create figures that tell a story about global development and productivity.

Your Tasks:

Import matplotlib
The Rise of Human Capital: Create a line plot showing the average human capital (hc) for all countries over time. Requirement: Add a title, label your axes, and include a grid.
Productivity vs. Wealth: Create a scatter plot for the year 2023 where the x-axis is hc (Human Capital) and the y-axis is gdp_pc (GDP per capita).
Create the same figure again but use plt.yscale('log').
The TFP Frontier: Use a horizontal bar chart (plt.barh) to show the ctfp (Total Factor Productivity) for a selection of countries in 2019: United States, China, India, Germany, Switzerland, Brazil, and Nigeria.
Distribution of Labor Shares: Create a histogram of the labsh (Labor Share) variable for the entire dataset. Use 30 bins to see the shape of the distribution.

Solution 5.1: Import Matplotlib

import matplotlib.pyplot as plt

Solution 5.2: The Rise in Human Capital

# 1. Calculate the average human capital for all countries by year
hc_trend = df.groupby('year')['hc'].mean()
hc_trend = hc_trend.reset_index()

# 2. Create the figure

# Initialize figure and set a figsize
plt.figure(figsize=(8, 5))
# plot year on x-axis and 'hc' on y axis
plt.plot(hc_trend['year'], hc_trend['hc'])
# Set a title and labels
plt.title("Global Average Human Capital Index (2000–2023)")
plt.xlabel("Year")
plt.ylabel("Average Human Capital Index")
# Add a grid
plt.grid(True)
# Show the figure
plt.show()

Solution 5.3: Productivity vs. Wealth

# Filter your data for 2023
df_2023 = df[df['year'] == 2023]


# Create the scatterplot 
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD)")
plt.show()

# Create the same plot with a log scale for the y-axis
plt.figure(figsize=(8, 6))
plt.scatter(df_2023['hc'], df_2023['gdp_pc'])
plt.yscale('log') # Log scale for GDP is standard in macroeconomics
plt.title("Human Capital vs. GDP per Capita (2023)")
plt.xlabel("Human Capital Index")
plt.ylabel("GDP per Capita (USD, Log Scale)")
plt.show()

Solution 5.4: The TFP Frontier

# Filter the 2023 data from previously for the countries, and sort values
selection = ['United States', 'China', 'India', 'Germany','Switzerland', 'Brazil', 'Nigeria']
df_sel = df_2023[df_2023['Country'].isin(selection)]
df_sel = df_sel.sort_values('ctfp')

# Create horizontal barchart
plt.figure(figsize=(8, 5))
plt.barh(df_sel['Country'], df_sel['ctfp'])
plt.title("Total Factor Productivity Levels in 2023 (USA = 1)")
plt.xlabel("TFP Level (Relative to USA)")
plt.show()

Solution 5.5: Distribution of Labor Shares

plt.figure(figsize=(8, 5))
plt.hist(df['labsh'].dropna(), bins=30, edgecolor='white')
plt.title("Distribution of Labor Shares Across All Observations")
plt.xlabel("Labor Share of GDP")
plt.ylabel("Frequency")
plt.show()

Exercise 6: Growth Accounting - Self-Study

In this independent deep dive, you will apply the ‘Solow Residual’ method to decompose GDP growth into its fundamental drivers—capital, labor, and technology—transforming a core macroeconomic theory into a practical data analysis.

In this exercise, we assume a Cobb-Douglas production function:

\[Y_t = A_t K_t^{\alpha}L_t^{1-\alpha}\]

Where \(Y\) is output, \(K\) is capital, \(L\) is labor, and \(A\) is technology (TFP). Our goal is to calculate the growth rate of \(A\). We want to understand how much of a country’s GDP growth comes from the three components (i) labor, (ii) capital and (iii) technological progress.

1. Preparation

Create a new dataframe ga_df that includes only the United States and Switzerland
Sort values by country and year (important for growth calculations later)

Solution 6.1: Preparation

# 1. Filter and Sort 
countries = ['United States', 'Switzerland']
ga_df = df[df['Country'].isin(countries)].copy()
ga_df = ga_df.sort_values(['Country', 'year'])
ga_df.head()

	iso3	Country	year	avh	ctfp	emp	hc	labsh	pl_con	pop	rgdpna	rnna	gdp_pc	k_labor	labor_prod
744	CHE	Switzerland	2000	1712.74	0.914894	3.944163	3.528448	0.654686	0.774927	7.184007	529301.6875	2352489.00	73677.780033	596448.279214	78.353252
745	CHE	Switzerland	2001	1672.85	0.933410	4.013566	3.538948	0.670151	0.778998	7.226391	537641.6875	2405543.00	74399.750512	599352.970383	80.076572
746	CHE	Switzerland	2002	1651.26	0.960231	4.042648	3.549480	0.684949	0.837000	7.278752	537248.0625	2456137.25	73810.464005	607556.497216	80.481013
747	CHE	Switzerland	2003	1664.54	0.929821	4.029146	3.560044	0.677805	1.008451	7.333447	537074.0000	2503626.50	73236.228475	621378.990849	80.080526
748	CHE	Switzerland	2004	1694.95	0.933693	4.041221	3.570638	0.666347	1.112374	7.384194	551584.1250	2557653.75	74697.946045	632891.311870	80.527136

2. Calculate Growth Rates

Create new variables:
- g_y: GDP growth (growth rate of rgdpna)
- g_l: Labor growth (growth rate of emp)
- g_k: Capital growth (growth rate of rnna)

You can calculate growth rates with .pct_change(). It will calculate the percentage change of a value and the previous value in your dataframe. This is why sorting is important, we want to calculate the change from one year to the next.

Important: We don’t want to accidentally compare the last value of switzerland to the first value of the United States. To avoid this you should group by ‘Country’ before you calculate the growth rates.

Solution 6.2: Calculate Growth Rates

ga_df = ga_df.sort_values(['Country', 'year'])
ga_df['g_y'] = ga_df.groupby('Country')['rgdpna'].pct_change()
ga_df['g_k'] = ga_df.groupby('Country')['rnna'].pct_change()
ga_df['g_l'] = ga_df.groupby('Country')['emp'].pct_change()
ga_df.head()

	iso3	Country	year	avh	ctfp	emp	hc	labsh	pl_con	pop	rgdpna	rnna	gdp_pc	k_labor	labor_prod	g_y	g_k	g_l
744	CHE	Switzerland	2000	1712.74	0.914894	3.944163	3.528448	0.654686	0.774927	7.184007	529301.6875	2352489.00	73677.780033	596448.279214	78.353252	NaN	NaN	NaN
745	CHE	Switzerland	2001	1672.85	0.933410	4.013566	3.538948	0.670151	0.778998	7.226391	537641.6875	2405543.00	74399.750512	599352.970383	80.076572	0.015757	0.022552	0.017597
746	CHE	Switzerland	2002	1651.26	0.960231	4.042648	3.549480	0.684949	0.837000	7.278752	537248.0625	2456137.25	73810.464005	607556.497216	80.481013	-0.000732	0.021032	0.007246
747	CHE	Switzerland	2003	1664.54	0.929821	4.029146	3.560044	0.677805	1.008451	7.333447	537074.0000	2503626.50	73236.228475	621378.990849	80.080526	-0.000324	0.019335	-0.003340
748	CHE	Switzerland	2004	1694.95	0.933693	4.041221	3.570638	0.666347	1.112374	7.384194	551584.1250	2557653.75	74697.946045	632891.311870	80.527136	0.027017	0.021580	0.002997

3. Define Capital Share (\(\alpha\)):

A crucial parameter in the production function is the capital share \(\alpha\). We assume that this is constant.

Create a new variable capital_share_annual, which is 1 minus the labor share labsh
Calculate the average capital share for both countries, alpha_che for switzerland, alpha_usa for the United States
Assign the average capital share back to the dataframe. Use .loc

So far, you have used .loc to find specific data (e.g., df.loc[row, column]). However, .loc is also one of the most powerful tools for creating or updating variables based on a condition. When you use the assignment operator (=) with .loc, you are telling Python: “Find all the rows that meet my condition, look at this specific column, and write this value there.”

# Example 
#            [1. The Rows]               [2. The Column]    [3. The Value]
ga_df.loc[df['Country'] == 'United States', 'alpha']        =  0.1

The Rows (The Filter): This identifies which rows you want to target (e.g., only the ones where the country is the United States).
The Column (The Target): This is the name of the variable. If the column ‘alpha’ doesn’t exist yet, Pandas will create it for you automatically.
The Value (The Assignment): This is the data you want to “paste” into those specific rows.

Solution 6.3: Define the Capital Share

# create variable cap_share_annual
ga_df['cap_share_annual'] = 1 - ga_df['labsh']

# Calculate the average alpha for each country
alpha_usa = ga_df[ga_df['Country'] == 'United States']['cap_share_annual'].mean()
alpha_che = ga_df[ga_df['Country']=='Switzerland']['cap_share_annual'].mean()

# Assign the average alpha back to the dataframe
ga_df.loc[ga_df['Country'] == 'United States', 'alpha'] = alpha_usa
ga_df.loc[ga_df['Country'] == 'Switzerland', 'alpha'] = alpha_che
ga_df.head()

	iso3	Country	year	avh	ctfp	emp	hc	labsh	pl_con	pop	rgdpna	rnna	gdp_pc	k_labor	labor_prod	g_y	g_k	g_l	cap_share_annual	alpha
744	CHE	Switzerland	2000	1712.74	0.914894	3.944163	3.528448	0.654686	0.774927	7.184007	529301.6875	2352489.00	73677.780033	596448.279214	78.353252	NaN	NaN	NaN	0.345314	0.341327
745	CHE	Switzerland	2001	1672.85	0.933410	4.013566	3.538948	0.670151	0.778998	7.226391	537641.6875	2405543.00	74399.750512	599352.970383	80.076572	0.015757	0.022552	0.017597	0.329849	0.341327
746	CHE	Switzerland	2002	1651.26	0.960231	4.042648	3.549480	0.684949	0.837000	7.278752	537248.0625	2456137.25	73810.464005	607556.497216	80.481013	-0.000732	0.021032	0.007246	0.315051	0.341327
747	CHE	Switzerland	2003	1664.54	0.929821	4.029146	3.560044	0.677805	1.008451	7.333447	537074.0000	2503626.50	73236.228475	621378.990849	80.080526	-0.000324	0.019335	-0.003340	0.322195	0.341327
748	CHE	Switzerland	2004	1694.95	0.933693	4.041221	3.570638	0.666347	1.112374	7.384194	551584.1250	2557653.75	74697.946045	632891.311870	80.527136	0.027017	0.021580	0.002997	0.333653	0.341327

4. The Solow Residual:

We want to calculate TFP growth (i.e. the growth of \(A\) in the production function).

Calculate TFP growth using the following formula:

\[g_A = g_Y - [\alpha g_K+(1−\alpha)g_L]\]

\(g_L\): growth of labour g_l
\(g_K\): growth of capital g_k
\(g_Y\): growth of GDP g_y
\(\alpha\): capital share alpha

If you want to derive this formula: apply \(ln\) to the production function, then use the fact that the change in the natural log of a variable is approximately equal to its growth rate (g)

Solution 6.4: TFP Growth

# 4. Calculate TFP Growth
ga_df['g_tfp'] = ga_df['g_y'] - (ga_df['alpha'] * ga_df['g_k'] + (1 - ga_df['alpha']) * ga_df['g_l'])
ga_df.head()

	iso3	Country	year	avh	ctfp	emp	hc	labsh	pl_con	pop	...	rnna	gdp_pc	k_labor	labor_prod	g_y	g_k	g_l	cap_share_annual	alpha	g_tfp
744	CHE	Switzerland	2000	1712.74	0.914894	3.944163	3.528448	0.654686	0.774927	7.184007	...	2352489.00	73677.780033	596448.279214	78.353252	NaN	NaN	NaN	0.345314	0.341327	NaN
745	CHE	Switzerland	2001	1672.85	0.933410	4.013566	3.538948	0.670151	0.778998	7.226391	...	2405543.00	74399.750512	599352.970383	80.076572	0.015757	0.022552	0.017597	0.329849	0.341327	-0.003532
746	CHE	Switzerland	2002	1651.26	0.960231	4.042648	3.549480	0.684949	0.837000	7.278752	...	2456137.25	73810.464005	607556.497216	80.481013	-0.000732	0.021032	0.007246	0.315051	0.341327	-0.012684
747	CHE	Switzerland	2003	1664.54	0.929821	4.029146	3.560044	0.677805	1.008451	7.333447	...	2503626.50	73236.228475	621378.990849	80.080526	-0.000324	0.019335	-0.003340	0.322195	0.341327	-0.004724
748	CHE	Switzerland	2004	1694.95	0.933693	4.041221	3.570638	0.666347	1.112374	7.384194	...	2557653.75	74697.946045	632891.311870	80.527136	0.027017	0.021580	0.002997	0.333653	0.341327	0.017677

5 rows × 21 columns

5. Summarize and Plot

First we want to create a summary table

Create a table summary with the mean of the growth variables and alpha ['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp'] by country
The raw growth rates of Capital (g_k) and Labor (g_l) don’t tell the whole story. We must multiply them by their respective “shares” in the economy to see how much they actually contributed to total GDP growth:
- Create a variable contrib_k: The contribution of capital which is \(\alpha g_K\)
- Create a variable contrib_l: The contribution of labor which is \((1-\alpha)g_L\)

Solution 6.5: Summary Table

# 1. Average the components 
summary = ga_df.groupby('Country')[['g_y', 'g_k', 'g_l', 'alpha', 'g_tfp']].mean()

# 2. Calculate the final contributions
summary['contrib_k'] = summary['alpha'] * summary['g_k']
summary['contrib_l'] = (1 - summary['alpha']) * summary['g_l']

# 3. Show the summary table
summary = summary.reset_index()
summary

	Country	g_y	g_k	g_l	alpha	g_tfp	contrib_k	contrib_l
0	Switzerland	0.018271	0.020339	0.012318	0.341327	0.003215	0.006942	0.008113
1	United States	0.021028	0.019545	0.008397	0.400359	0.008168	0.007825	0.005035

Then we create a stacked barchart to visualize our results.

The code below creates a figure that decomposes average annual GDP growth into the specific contributions of capital accumulation, labor input, and Total Factor Productivity (TFP) to illustrate the underlying drivers of economic growth for each country.

We haven’t covered “stacked” charts in class, deduce how this works by looking at the three plt.bar() calls.

# 1. Setup the data from the columns
countries = summary['Country']
tfp_part = summary['g_tfp']
k_part = summary['contrib_k']
l_part = summary['contrib_l']

# 2. Create the figure
plt.figure(figsize=(8, 4.5))

# Plot Layer 1: TFP (The base)
plt.bar(countries, tfp_part, label='TFP Growth')

# Plot Layer 2: Capital (Stacked on top of TFP)
plt.bar(countries, k_part, bottom=tfp_part, label='Capital Contribution')

# Plot Layer 3: Labor (Stacked on top of TFP + Capital)
plt.bar(countries, l_part, bottom=tfp_part + k_part, label='Labor Contribution')

# 3. Add styling
plt.ylabel("Average Annual Growth Rate")
plt.title("Growth Accounting Decomposition")
plt.legend()

plt.tight_layout()
plt.show()

How does the stacked barchart work?

We call plt.bar() three separate times. Each call tells Matplotlib to draw a set of bars on the same figure.

The first call draws the TFP bars.
The second call draws the Capital bars.
The third call draws the Labor bars.

The bottom Argument: This is the secret to “stacking.” By default, Matplotlib starts every bar at zero.

In the second call, we set bottom=tfp_part. This tells Python: “Don’t start the Capital bars at zero; start them at the height where the TFP bars ended.”
In the third call, we set bottom=tfp_part + k_part. This stacks the Labor bars on top of both the TFP and Capital layers combined.

Labels and Legend:

Inside each plt.bar() function, we provide a label (e.g., ‘TFP Growth’).
However, these labels won’t show up on the chart by themselves. We must call plt.legend() at the end. This command looks at all the labels we’ve defined and creates the “Key” or “Legend” in the corner of the chart so the reader knows which color represents which economic component.