In this exercise, we will practice working with real-world, imperfect data. We will cover how to identify and handle missing observations, use pandas’ DatetimeIndex to manage time series, and apply aggregation and resampling techniques. Along the way, we will apply these skills to practical economic scenarios,such as analyzing competitive dynamics in the tech sector, and conclude by building our own reusable Python tools.
Import Libraries
import pandas as pdimport matplotlib.pyplot as plt
---------------------------------------------------------------------------ModuleNotFoundError Traceback (most recent call last)
CellIn[1], line 1----> 1importpandasaspd 2importmatplotlib.pyplotaspltModuleNotFoundError: No module named 'pandas'
Data
Download the dataset firm_data_ex06.csv and save it in a folder data.
The data we use in this exercise is from Compustat North America Fundamentals Quarterly. It’s a dataset of publicly listed North American companies’ quarterly financial statement fundamentals. It provides comparable items from the income statement, balance sheet, and cash flow statement, along with firm identifiers and reporting dates for time-series analysis. In this exercise we use only a small number of firms and variables. If you want to practice more you have access to the full data through the university.
Variables in the Dataset
Variable Name
Description
Context
ticker
Stock ticker symbol
A unique identifier for the firm on a stock exchange (e.g., AAPL, NVDA, F)
date
Observation date
The end date of the reporting period.
gvkey
Global Company Key
The unique permanent identifier for the company in the Compustat database.
comp_name
Company Name
The full legal name of the company.
fiscal_year
Fiscal Year
fiscal_quarter
Fiscal Quarter
assets
Total Assets
The total value of everything the firm owns; a common proxy for firm size.
net_income
Net Income
The profit or loss after all expenses, interest, and taxes have been paid.
ebitda
EBITDA
Earnings Before Interest, Taxes, Depreciation, and Amortization; measures core operating profit.
revenue
Total revenue
The income from sales before any expenses are subtracted
rd_expense
R&D Expense
Spending on Research and Development
cap_ex_ytd
CapEx (YTD)
Capital Expenditures reported as a Year-To-Date
sector_code
Sector Code
GICS sector classification code .
Exercise 0: Initial Data Exploration
This exercise focuses on loading the dataset and performing a high-level inspection of the sample to understand which variables are in the dataset, and which firms and time periods are included in the analysis.
Your Tasks:
Data Loading and Inspection: Load the file data/firm_data_ex06.csv into a DataFrame named df and display the first few rows.
Firm Identification: Calculate the total number of unique tickers and print the full names of all companies in the sample.
Observation Count: Count how many data points are available for each company to check if the dataset is balanced (i.e., whether we have the exact same number of quarterly observations for every single firm).
Calendar Range: Identify the earliest and latest dates in the dataset.
Solution 0.1
# Task 1: Data Loading and Inspectiondf = pd.read_csv("data/firm_data_ex06.csv")df.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[2], line 2 1# Task 1: Data Loading and Inspection----> 2 df = pd.read_csv("data/firm_data_ex06.csv")
3 df.head()
NameError: name 'pd' is not defined
Solution 0.2
# Task 2: Find number of companies in the datasetn_companies = df['ticker'].nunique()print(f"Unique Firms in Sample: {n_companies}")company_names = df['comp_name'].unique().tolist()print("Full Company Names:")print(company_names)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[3], line 2 1# Task 2: Find number of companies in the dataset----> 2 n_companies = df['ticker'].nunique()
3print(f"Unique Firms in Sample: {n_companies}")
5 company_names = df['comp_name'].unique().tolist()
NameError: name 'df' is not defined
Solution 0.3
# Task 3: Count observations per firm# .value_counts() tells us how many rows exist for each unique entryprint("\nObservations per company:")print(df['comp_name'].value_counts())
Observations per company:
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[4], line 4 1# Task 3: Count observations per firm 2# .value_counts() tells us how many rows exist for each unique entry 3print("\nObservations per company:")
----> 4print(df['comp_name'].value_counts())
NameError: name 'df' is not defined
Solution 0.4
# Task 4: Check the calendar range# Note: Because the dates are currently loaded as strings, pandas finds the # min/max by sorting them alphabetically. This only works correctly if # the strings are perfectly formatted as YYYY-MM-DD!print(f"\nData starts on: {df['date'].min()}")print(f"Data ends on: {df['date'].max()}")
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[5], line 6 1# Task 4: Check the calendar range 2# Note: Because the dates are currently loaded as strings, pandas finds the 3# min/max by sorting them alphabetically. This only works correctly if 4# the strings are perfectly formatted as YYYY-MM-DD!----> 6print(f"\nData starts on: {df['date'].min()}")
7print(f"Data ends on: {df['date'].max()}")
NameError: name 'df' is not defined
Exercise 1: Identifying and Handling Missing Data
In the lecture, we discussed how the 2025 federal shutdown caused a gap in official unemployment statistics. Corporate financial data is often much messier and contains structural missing values (NaNs) because different industries/firms follow different reporting standards. This exercise focuses on identifying these gaps and applying appropriate strategies to handle them.
Part A: Identifying Gaps
Before fixing data, you must understand where it is broken. We start by quantifying the missing values in our dataset.
Your Tasks
Overview: Use the .info() method to get a high-level summary of the dataset. Which columns appear to have the most missing values?
Quantification: Calculate the exact number of NaN entries for every variable.
Identify Affected Firms: Pinpoint exactly which companies have missing data for a specific variable (e.g., assets). Create a subset of the data containing only the rows where that variable is NaN, then extract a list of the unique company names.
Solution 1.A.1: Overview
df.info()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[6], line 1----> 1df.info()
NameError: name 'df' is not defined
Solution 1.A.2: Quantification
# .isna() returns True for missing values; .sum() counts those Truesmissing_counts = df.isna().sum()missing_counts
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[7], line 2 1# .isna() returns True for missing values; .sum() counts those Trues----> 2 missing_counts = df.isna().sum()
3 missing_counts
NameError: name 'df' is not defined
Solution 1.A.3: Identify Firms with Missing Values
var_name ='assets'# Change variable here tomissing_rows = df[df[var_name].isna()]firms_nan = missing_rows['comp_name'].unique().tolist()print(firms_nan)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[8], line 2 1 var_name = 'assets'# Change variable here to----> 2 missing_rows = df[df[var_name].isna()]
3 firms_nan = missing_rows['comp_name'].unique().tolist()
4print(firms_nan)
NameError: name 'df' is not defined
Part B: Strategies for Handling Missing Data
Now that we know where the gaps are we can think about strategies to handle them. There are three strategies we consider:
Dropping: Removing incomplete rows.
Filling: Replacing NaN with a constant (like 0 for R&D).
Interpolation: Estimating values by “drawing a line” between known points.
Your Tasks
The Aggressive Approach: Create a new DataFrame called df_dropped by dropping all rows where there are ANY missing values. How many observations did you lose? How can you drop observations in a less aggressive way?
Solution 1.B.1: The Aggressive Approach
# Use .dropna() to drop all rows where there are ANY missing df_dropped = df.dropna()# Find number of observations of dataframe with len(), compare df and df_dropped:print(f"Original observations: {len(df)}")print(f"Observations after dropna: {len(df_dropped)}")print(f"Rows lost: {len(df) -len(df_dropped)}")
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[9], line 2 1# Use .dropna() to drop all rows where there are ANY missing ----> 2 df_dropped = df.dropna()
4# Find number of observations of dataframe with len(), compare df and df_dropped: 5print(f"Original observations: {len(df)}")
NameError: name 'df' is not defined
A less aggressive approach: Only drop if a SPECIFIC crucial variable is missing. For example if we can’t do our analysis if ‘assets’ is missing, but we don’t care if ‘rd_expense’ is then we could use the subset argument:
df_dropped_subset = df.dropna(subset=['assets'])
Economic Logic (Filling): For many firms, a missing value in rd_expense means they simply spent $0 on research that quarter. Create a new column called rd_filled where all NaN values in rd_expense are replaced with 0. Check if it worked
Solution 1.B.2: Economic Logic (Filling)
# We assume no report equals no spending for this variable, use .fillna()df['rd_filled'] = df['rd_expense'].fillna(0)# Did it work? Filter for observations where rd_expense is missing and check whether rd_filled is 0df[df['rd_expense'].isna()][['date', 'rd_expense', 'rd_filled']].head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[10], line 2 1# We assume no report equals no spending for this variable, use .fillna()----> 2 df['rd_filled'] = df['rd_expense'].fillna(0)
4# Did it work? Filter for observations where rd_expense is missing and check whether rd_filled is 0 5 df[df['rd_expense'].isna()][['date', 'rd_expense', 'rd_filled']].head()
NameError: name 'df' is not defined
The “Naive” Interpolation Trap: For missing values of ‘assets’ it might be sensible to interpolate linearly. Run the following code to create an interpolated column 'assets_interp':
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[11], line 1----> 1 df['assets_interp'] = df['assets'].interpolate(method='linear')
NameError: name 'df' is not defined
Show the first 10 rows of the data for Tesla (ticker:‘TSLA’) for the variables ['date', 'comp_name', 'assets', 'assets_interp']
Explain why applying interpolate() to the entire DataFrame at once might lead to incorrect data points.
Solution 1.B.3: Naive Interpolation
# Create assets_interp according to the suggestion:df['assets_interp'] = df['assets'].interpolate(method='linear')# Select the data for tesla, select columns of interest, show first 10 rowsdf[df['ticker']=='TSLA'][['date', 'comp_name', 'assets', 'assets_interp']].head(10)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[12], line 2 1# Create assets_interp according to the suggestion:----> 2 df['assets_interp'] = df['assets'].interpolate(method='linear')
4# Select the data for tesla, select columns of interest, show first 10 rows 5 df[df['ticker']=='TSLA'][['date', 'comp_name', 'assets', 'assets_interp']].head(10)
NameError: name 'df' is not defined
The first time Tesla reports assets is 2008 Q4, the interpolated values come from another company that appeared before tesla in the data.
A cautionary tale: Applying interpolate() to the dataframe like this was a solution suggested by AI, it runs without an error, but it’s not correct. AI is very helpful for programming, but it is still crucial to have a good understanding of the code and to come up with ways to check whether it’s actually correct. Just because something runs without an error doesn’t mean it’s doing what was intended.
The (almost) Correct Way: Use the following “Group-Transform” command to interpolate correctly.
Group the data by firms (use ticker, gvkey or any firm identifier)
Pick the column we want to change
Use .transform()
In .transform() We use a lambda function to tell pandas what to do: lambda x: x.interpolate(method='linear'). Read this as: “For each group (x), calculate the linear interpolation and broadcast it back so it fits our original table.”
Solution 1.B.4: The (almost) Correct Way
df['assets_interp'] = ( df.groupby('ticker')['assets'] # group by firm, pick column .transform( # apply .transformlambda x: # transform every group x x.interpolate(method='linear') # by interpolating linearly ))# Check for Tesla if it worked:df[df['ticker']=='TSLA'][['date', 'comp_name', 'assets', 'assets_interp']].head(10)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[13], line 2 1 df['assets_interp'] = (
----> 2df.groupby('ticker')['assets'] # group by firm, pick column 3 .transform( # apply .transform 4lambda x: # transform every group x 5 x.interpolate(method='linear') # by interpolating linearly 6 )
7 )
8# Check for Tesla if it worked: 9 df[df['ticker']=='TSLA'][['date', 'comp_name', 'assets', 'assets_interp']].head(10)
NameError: name 'df' is not defined
There is one potential issue still with this, can you catch it?
The potential issue and how to fix it
The dates in this dataframe happen to be sorted already, if they weren’t this code would run without an error but might draw a linear trend between two dates that don’t follow each other. For interpolation to work correctly we should sort the dataframe by ‘date’ (or by firms (via ticker for example and date)).
# First step: Sort dataframe by firms and date!df = df.sort_values(['ticker', 'date'])# Then you can interpolatedf['assets_interp'] = df.groupby('ticker')['assets'].transform(lambda x: x.interpolate(method='linear'))
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[14], line 2 1# First step: Sort dataframe by firms and date!----> 2 df = df.sort_values(['ticker', 'date'])
4# Then you can interpolate 5 df['assets_interp'] = df.groupby('ticker')['assets'].transform(lambda x: x.interpolate(method='linear'))
NameError: name 'df' is not defined
To Interpolate or Not? Copy and run the following function in your notebook. It will allow you to quickly visualize the difference between the original data (with gaps) and our interpolated version.
Run plot_firm_assets('TSLA'). Looking at the plot, argue whether interpolation is a sensible choice for filling the gaps in Tesla’s assets.
Run plot_firm_assets('GM'). Looking at the plot, argue again whether interpolation is a sensible choice for filling the gaps.
def plot_firm_assets(ticker_name):""" Plots original vs. interpolated assets for a specific ticker. """# 1. Filter for the specific firm firm_data = df[df['ticker'] == ticker_name]# 2. Create the plot plt.figure(figsize=(8, 5))# Plot interpolated data (continuous line) plt.plot(firm_data['date'], firm_data['assets_interp'], label='Interpolated Assets', color='orange', linestyle='-', marker='o', alpha=0.8)# Plot original data (points with gaps) plt.plot(firm_data['date'], firm_data['assets'], label='Original Assets', color='blue', marker='o', markersize=4)# 3. Formatting plt.title(f"{ticker_name}: Original vs. Interpolated Assets") plt.xlabel("Date") plt.ylabel("Total Assets (Millions)") plt.legend() plt.xticks(rotation=45) plt.gca().xaxis.set_major_locator(plt.MaxNLocator(10)) plt.grid(True, linestyle=':', alpha=0.6) plt.tight_layout() plt.show()
Example Tesla
plot_firm_assets('TSLA')
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[16], line 1----> 1plot_firm_assets('TSLA')CellIn[15], line 6, in plot_firm_assets(ticker_name) 2""" 3Plots original vs. interpolated assets for a specific ticker. 4""" 5# 1. Filter for the specific firm----> 6 firm_data = df[df['ticker'] == ticker_name]
8# 2. Create the plot 9 plt.figure(figsize=(8, 5))
NameError: name 'df' is not defined
Teslas assets show a relatively smooth trajectory, so linear interpolation seems reasonable. The growth trajectory could be exponential, which we could account for by using a different method in interpolate(). But the orange interpolated segments are short, meaning we are only “guessing” for a few quarters at a time. The closer the known data points are to each other, the more accurate linear interpolation tends to be.
Example General Motors
plot_firm_assets('GM')
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[17], line 1----> 1plot_firm_assets('GM')CellIn[15], line 6, in plot_firm_assets(ticker_name) 2""" 3Plots original vs. interpolated assets for a specific ticker. 4""" 5# 1. Filter for the specific firm----> 6 firm_data = df[df['ticker'] == ticker_name]
8# 2. Create the plot 9 plt.figure(figsize=(8, 5))
NameError: name 'df' is not defined
In GM’s case linear interpolation may not be reasonable. The “missing/odd” asset value in 2009 Q2 lines up with its bankruptcy filing (June 1, 2009) and the rapid sale of most operating assets into a newly created “New GM” (completed July 10, 2009). In other words, this isn’t a normal, smooth evolution of the same balance sheet—it’s a break in the reporting entity and accounting perimeter (assets and liabilities were split between “Old GM”/liquidation and the reorganized company). Because linear interpolation assumes the underlying process is continuous and comparable quarter to quarter, it’s not economically meaningful here: the “true” path wasn’t a gradual transition, it was a discrete restructuring event.
A Note on how to handle Missing Data
Handling missing data is often presented as a “pre-processing” step, something you do once at the beginning to get a “clean” dataset. In reality, the best way to handle a gap depends on your specific research question.
Filling or Interpolating is useful when you need a continuous series for visualizations or for metrics where a missing value has a clear economic meaning (e.g., a missing R&D field often means $0 spending).
Leaving it as NaN is often the most “honest” approach. In many statistical models, it is better to lose an observation than to feed the model a “guessed” or “fabricated” number that might bias your results.
Rule of Thumb: Don’t clean just for the sake of having a full table. Clean when you understand the economic logic behind the gap and when you understand that for your particular question a specific method like dropping missing values, filling or interpolating is needed and makes sense.
Exercise 2: DateTime Index in Panel Data
Your Tasks
Datetime Conversion: Check the datatype of ‘date’ with .dtype, convert the date column into a datetime64 object. Verify the change by printing the column’s data type.
Setting the Index: Set this newly converted date column as the DataFrame’s index to create a DatetimeIndex. Use .sort_index() to sort the dataframe by its new index, what do you observe?
Slicing Historical Windows: Use the .loc accessor to create a subset called df_gfc that contains all firms but only for the years 2008 through 2010 (the Great Financial Crisis period). Make sure the index is sorted. (Hint: You can test whether the index is sorted using df.index.is_monotonic_increasing)
Solution 2.1: Datetime Conversion
# 1. Convert to datetimeprint(df['date'].dtype)df['date'] = pd.to_datetime(df['date']) print(f"New Data Type: {df['date'].dtype}")
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[18], line 2 1# 1. Convert to datetime----> 2print(df['date'].dtype)
3 df['date'] = pd.to_datetime(df['date'])
4print(f"New Data Type: {df['date'].dtype}")
NameError: name 'df' is not defined
Solution 2.2: Setting the Index
# 2. Set as indexdf = df.set_index('date') print(f"Index Type: {type(df.index)}") # Display the first 10 rows of the dataframe sorted by its indexdf.sort_index().head(10)
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[19], line 2 1# 2. Set as index----> 2 df = df.set_index('date')
3print(f"Index Type: {type(df.index)}")
6# Display the first 10 rows of the dataframe sorted by its indexNameError: name 'df' is not defined
In the lecture we have seen unemployment rates where we had one observation for every ‘date’. Here we are working with panel data. When you set a DatetimeIndex on panel data the index is not unique. For every ‘date’ we have the observations of multiple firms that will each have a row for this date.
The ‘date’ column ('datadate' in compustat) is the end of the fiscal period (quarter) for that observation. It is common that the fiscal period aligns with the calendar period, i.e. the first quarter is January-March, and the fiscal year ends in December. But this is not the case for all firms, there are many exemptions.
Nvidia the 2004 fiscal year ends 2005-01-31. That’s why the observation for 2005-01-31 is for the ‘fiscal_year’ of 2004 and ‘fiscal_quarter’ of 4.
Apple’s fiscal year ends in September, their 2005 fiscal year already starts in September 2004 which is why their date ‘2005-03-31’ corresponds to the second quarter of the fiscal year 2005.
Solution 2.3: Slicing Historical Windows
# First sort the index # Note: DatetimeIndex allows us to slice by year strings directlyprint(f"Is the index sorted? {df.index.is_monotonic_increasing}")df = df.sort_index()print(f"Is the index sorted? {df.index.is_monotonic_increasing}")# Slice the Great Financial Crisis window# Note: DatetimeIndex allows us to slice by year strings directlydf_gfc = df.loc["2008":"2010"] df_gfc.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[20], line 3 1# First sort the index 2# Note: DatetimeIndex allows us to slice by year strings directly----> 3print(f"Is the index sorted? {df.index.is_monotonic_increasing}")
4 df = df.sort_index()
5print(f"Is the index sorted? {df.index.is_monotonic_increasing}")
NameError: name 'df' is not defined
Why sorting is necessary: When you ask pandas for a range (e.g., .loc["2008":"2010"]), pandas needs to find the “start” point and the “end” point. If the index is not sorted (for example if your dataframe is sorted by firms) it will raise an error. You can check whether your index is sorted with df.index.is_monotonic_increasing which will be True if it is and False otherwise
Exercise 3: Aggregation and Resampling
In this exercise, we will “downsample” our data from a quarterly frequency to an annual frequency. Because we are working with panel data, we must always remember to group by our firm identifier (ticker or gvkey) before resampling; otherwise, pandas will aggregate all companies together (which can be exactly what we want, but not in this exercise).
Your Tasks
Annual Revenue: Calculate the total annual revenue for each firm. (Use groupby('ticker') before you resample, and choose an appropriate aggregation method.)
Advanced Aggregation: Create an annual dataframe df_annual with revenue, net_income and assets. You can do so by resampling, using .agg() and specifying in a dictionary which aggregation method you want to use for each of the variables.
Handling the MultiIndex: You will notice that df_annual has a “MultiIndex” (rows are nested by Ticker and then Date).
Selection: Use .loc to select all annual data for Apple (AAPL) from your new df_annual DataFrame.
Full Reset: Use .reset_index() to turn the entire MultiIndex back into regular columns.
Partial Reset: Start again from the MultiIndexed df_annual and use .reset_index(level=...) to move only the ticker back to a column while keeping the date as the index.
Solution 3.1: Annual Revenue
# 1. Total Annual Revenue# We use 'YE' for Year-End to resample# aggregation method .sum()annual_rev = df.groupby('ticker')['revenue'].resample('YE').sum()annual_rev.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[21], line 4 1# 1. Total Annual Revenue 2# We use 'YE' for Year-End to resample 3# aggregation method .sum()----> 4 annual_rev = df.groupby('ticker')['revenue'].resample('YE').sum()
5 annual_rev.head()
NameError: name 'df' is not defined
Solution 3.2: Advanced Aggregation
# Define which math to apply to which columndf_annual = df.groupby('ticker').resample('YE').agg({'revenue': 'sum','net_income': 'sum','assets': 'last'})df_annual.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[22], line 2 1# Define which math to apply to which column----> 2 df_annual = df.groupby('ticker').resample('YE').agg({
3'revenue': 'sum',
4'net_income': 'sum',
5'assets': 'last' 6 })
7 df_annual.head()
NameError: name 'df' is not defined
Why these methods?
In our example, we used .sum() for revenue and net_income because these are “Flow” variables—they represent the total amount of money earned or spent throughout the four quarters of the year. Conversely, we used .last() for assets because it is a “Stock” variable. A company’s assets are a snapshot of what they own at a specific moment; summing the assets from March, June, September, and December would double-count the same buildings and cash, leading to a nonsensical result.
Solution 3.3: Handling the MultiIndex
# Subtask A: Selection# On a MultiIndex (Ticker, Date), .loc picks the first level by defaultaapl_annual = df_annual.loc['AAPL']print("Apple Annual Data:")aapl_annual.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[23], line 3 1# Subtask A: Selection 2# On a MultiIndex (Ticker, Date), .loc picks the first level by default----> 3 aapl_annual = df_annual.loc['AAPL']
4print("Apple Annual Data:")
5 aapl_annual.head()
NameError: name 'df_annual' is not defined
# Subtask B: Full Reset# This turns everything into a 'flat' table, ideal for plotting or exporting to CSVdf_flat = df_annual.reset_index()df_flat.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[24], line 3 1# Subtask B: Full Reset 2# This turns everything into a 'flat' table, ideal for plotting or exporting to CSV----> 3 df_flat = df_annual.reset_index()
4 df_flat.head()
NameError: name 'df_annual' is not defined
# Subtask C: Partial Reset# This moves the 'ticker' to a column but keeps the 'date' as the indexdf_partial_reset = df_annual.reset_index(level='ticker')df_partial_reset.head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[25], line 3 1# Subtask C: Partial Reset 2# This moves the 'ticker' to a column but keeps the 'date' as the index----> 3 df_partial_reset = df_annual.reset_index(level='ticker')
5 df_partial_reset.head()
NameError: name 'df_annual' is not defined
In economics and finance, we often work with panel data, where we track multiple entities (firms, countries, or individuals) over several time periods. This is mathematically represented as \(y_{it}\), where \(y\) is a variable like revenues, \(i\) denotes the specific entity (e.g., Apple) and \(t\) represents the time (e.g., Q1 2024). A MultiIndex is the way Python’s pandas library handles this two-dimensional relationship within a single table. By using both the entity and the date as the index, the DataFrame effectively “stacks” these dimensions, ensuring that every row is uniquely identified by its “Who” and its “When.”
A MultiIndex is sometimes very useful and sometimes very annoying. You can easily go back and forth using .set_index() (use a list e.g. ['ticker', 'date']) to create a MultiIndex), and .reset_index() depending on what’s better suited for your analysis.
Exercise 4: Analyzing Competitive Dynamics in the Tech Sector
In any competitive industry, market leadership is rarely permanent. Structural breaks—specific moments in time where a new technology or market shift fundamentally changes the landscape—can shift things rapidly, altering the growth trajectories of even the most established firms. In our sample data, we have 5 technology-focused firms:
NVIDIA CORP: A leader in GPUs and AI hardware. (ticker = ‘NVDA’)
APPLE INC: Consumer electronics and software. (ticker = ‘AAPL’)
INTEL CORP: Semiconductor manufacturing and design. (ticker = ‘INTC’)
MICROSOFT CORP: Software, cloud services, and hardware. (ticker = ‘MSFT’)
We will explore how these firms performed with the emergence of Large Language Models (LLMs) by calculating growth rates and visualizing the potential decoupling of market leaders.
Your Tasks:
Revenue Growth Analysis: Create a new column called rev_growth that represents the year-over-year percentage growth of revenue for each firm.
Baseline Visualization: As an example we’ll compare nvidia and intel. Create two temporary DataFrames, nvda and intc with the data for the respective company. Then, generate a baseline plot comparing their revenue growth rates, with a title, labels, and a basic grid. You can do so by calling plt.plot() twice, you should use the label= argument, and add plt.legend() before you show the figure.
Refining the Aesthetics: Improve the professional look of your chart.
Add a horizontal line at y=0 using plt.axhline().
Make the grid less dominant (e.g., using alpha=0.3).
Change the colors and linestyles to distinguish the firms clearly.
Change the title’s fontsize and set its location (loc) to the left.
Adding Context: Highlight the “AI Era” to tell a clearer story by using plt.axvspan() to shade the period from November 2022 (the release of ChatGPT) to the end of the sample. Remember that your x-axis is datetime, so your start and end of the span should be datetime as well.
Solution 4.1: Revenue Growth Analysis
# To calculate YoY growth correctly in a panel, we must group by ticker.# This prevents 'leaking' data between different firms.# A lag of 4 is used for quarterly data to compare the same quarter across years.df['rev_growth'] = df.groupby('ticker')['revenue'].pct_change(4) *100# Inspecting the resultdf[df['ticker'] =='NVDA'].head()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[26], line 4 1# To calculate YoY growth correctly in a panel, we must group by ticker. 2# This prevents 'leaking' data between different firms. 3# A lag of 4 is used for quarterly data to compare the same quarter across years.----> 4 df['rev_growth'] = df.groupby('ticker')['revenue'].pct_change(4) * 100 6# Inspecting the result 7 df[df['ticker'] == 'NVDA'].head()
NameError: name 'df' is not defined
Solution 4.2: Baseline Visualization
# 1. Filter the datanvda = df[df['ticker'] =='NVDA']intc = df[df['ticker'] =='INTC']# 2. Setup the figureplt.figure(figsize=(9, 4.5))# 3. Plot the growth ratesplt.plot(nvda.index, nvda['rev_growth'], label='NVIDIA (AI Focus)')plt.plot(intc.index, intc['rev_growth'], label='Intel (PC/Server Focus)')# 5. Professional Stylingplt.title("Tech Sector Dynamics: Revenue Growth in the AI Era")plt.ylabel("YoY Revenue Growth (%)")plt.legend()plt.grid()plt.show()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[27], line 2 1# 1. Filter the data----> 2 nvda = df[df['ticker'] == 'NVDA']
3 intc = df[df['ticker'] == 'INTC']
5# 2. Setup the figureNameError: name 'df' is not defined
It’s always best to start with a simple baseline plot to see if the figure is even interesting. Then you can think about how to improve it visually. Here for example we could
Reduce Visual Noise: The current grid is very dominant; lightening it ensures the focus remains on the data lines.
Establish a Baseline: Adding a line at zero would help the reader immediately see when a company is shrinking (negative growth) versus just growing slowly.
Visual Hierarchy: Moving the title to the left and increasing its size makes it feel less like a default output and more like a professional publication.
Distinct Styles: Using specific colors and different line textures (like dashed lines) helps differentiate the firms even if the chart is printed in black and white.
Solution 4.3: Refining the Aesthetics
nvda = df[df['ticker'] =='NVDA']intc = df[df['ticker'] =='INTC']plt.figure(figsize=(9, 4.5))# New: change color, linestyle, linewidthplt.plot(nvda.index, nvda['rev_growth'], label='NVIDIA (AI Focus)', color='#76b900', lw=2)plt.plot(intc.index, intc['rev_growth'], label='Intel (PC/Server Focus)', color='#0071c5', lw=2, linestyle='--')# New: change location and fontsiize of titleplt.title("Tech Sector Dynamics: Revenue Growth in the AI Era", loc='left', fontsize=14) plt.ylabel("YoY Revenue Growth (%)")plt.legend()# New: Add baselineplt.axhline(0, color='black', lw=1, alpha=0.5)# New: Make grid more subtleplt.grid(alpha=0.3)plt.show()
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[28], line 1----> 1 nvda = df[df['ticker'] == 'NVDA']
2 intc = df[df['ticker'] == 'INTC']
5 plt.figure(figsize=(9, 4.5))
NameError: name 'df' is not defined
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[29], line 1----> 1 nvda = df[df['ticker'] == 'NVDA']
2 intc = df[df['ticker'] == 'INTC']
4 plt.figure(figsize=(9, 4.5))
NameError: name 'df' is not defined
The function plt.axvspan(xmin, xmax, ...) creates a vertical rectangle (span) across the axes. You should label it and you can change the appearance with additional arguments, most common are color and alpha.
Note that we worked with a DateTime index so the x-axis of our figure is composed of internal datetime64 objects. That’s why we need to convert the string ‘2022-11-01’ to datetime first.
Exercise 5: From Scripts to Reusable Tools
In Exercise 4, we created a visualization for the comparison of two firms to illustrate a competitive shift. We might want to analyze different firms in the same way; instead of copying and pasting the code every time, it is better to wrap our logic into a function. This makes our workflow more efficient, readable, and less prone to errors.
Your Tasks:
Create a Reusable Plotting Function: Write a function called compare_growth(df, ticker_1, ticker_2) that takes the dataframe as well as two stock tickers as arguments and plots the figure created in Exercise 4 for any company.
Hints for Task 1
If you are stuck on how to turn your script into a function, follow these steps:
Define the Function: Start with def compare_growth(df, ticker_1, ticker_2):. Remember that everything inside the function must be indented.
Copy and Paste: Move your successful plotting code from Exercise 4 into the body of the function.
Replace Hard-Coded Values: Look for where you specifically wrote 'NVDA' or 'INTC' in your filters and replace them with the arguments ticker_1 and ticker_2.
Automate with f-strings: Use f-strings in your title and labels to make them dynamic. For example: plt.title(f"Growth Comparison: {ticker_1} vs {ticker_2}", ...) or label=f"{ticker_1}".
Call the Function: Don’t forget to actually run the function at the end to test it
Discussion: Enhancement and Robustness: Reflect on your function and answer the following:
Feature Expansion: What are additional features you could add to this function?
Robustness: How could you make this function “robust” to errors?
Solution 5.1: Create a Reusable Plotting Function
def compare_growth(df, ticker_1, ticker_2):# Filter the data firm_a = df[df['ticker'] == ticker_1] firm_b = df[df['ticker'] == ticker_2]# Setup the figure plt.figure(figsize=(9, 4.5))# Plot lines with the styling from Ex 4 plt.plot(firm_a.index, firm_a['rev_growth'], label=f'{ticker_1}', color='#76b900', lw=2) plt.plot(firm_b.index, firm_b['rev_growth'], label=f'{ticker_2}', color='#0071c5', lw=2, linestyle='--')# Add AI Era Shading (using pd.to_datetime for axis compatibility) ai_start = pd.to_datetime('2022-11-01') plt.axvspan(ai_start, df.index.max(), color='gray', alpha=0.15, label='AI Era')# Professional Styling plt.axhline(0, color='black', lw=1, alpha=0.5) plt.grid(axis='y', alpha=0.3) plt.title(f"Tech Sector Dynamics: {ticker_1} vs {ticker_2}", loc='left', fontsize=14) plt.ylabel("YoY Revenue Growth (%)") plt.legend() plt.show()
1
Filter the data using the input provided by the user, ticker_1 / ticker_2.
2
The labels should adjust automatically depending on the user input, use f-string
3
The title should also adjust automatically depending on the user input, use f-string
# Test the function, use different tickers e.g. apple and googlecompare_growth(df, 'AAPL', 'GOOGL')
---------------------------------------------------------------------------NameError Traceback (most recent call last)
CellIn[31], line 2 1# Test the function, use different tickers e.g. apple and google----> 2 compare_growth(df, 'AAPL', 'GOOGL')
NameError: name 'df' is not defined
Solution 5.2: Discussion
(i) Feature Expansion
Metric Selection: Add an argument to choose between net_income, revenue, or profit_margins when comparing the companies
Flexible Shading: Add a boolean argument (show_ai_era=True) to toggle the shading on or off.
Benchmarking: Add a line representing the industry average to see if a firm is “beating the market”.
Company Name Mapping: Use a dictionary (e.g., names = {'NVDA': 'NVIDIA Corp', ...}) so the legend / titledisplays the full name instead of the ticker symbol.
…
(ii) Robustness
Ticker Verification: Use an if statement to check if the input tickers exist in your dataset. If it doesn’t print a helpful message and exit.
Frequency Verification: Ensure the data is actually quarterly. If there is a “hole” in the time series (e.g., a missing year of filings), pct_change(4) will return incorrect results.
Financial data often contains “fat tails” or one-time accounting adjustments (like a massive legal settlement). You could add a check that warns the user if a growth rate exceeds a certain threshold (e.g., >500%), but plots the figure anyway
Overlap Check: Verify that the two firms have overlapping date ranges so the comparison is actually possible on a single timeline.
…
Also visually the figure could be improved, for example the legend moves around depending on the data, we could fix the location using loc= but have to consider an option that works irrespective of the data.