Homework 2: MFIN 7037
In this assignment, we’ll learn the basics of building a strategy. You’ll need data from our Dropbox.
What to submit
1. You make work in groups of 3 (no working in groups of 4, sorry.)
a. Due date: March 8, 2024, morning.
2. What to submit: A writeup and code. The writeup may be an .ipynb file with embedded output + writeup.
3. You must use code. Python, R, Julia are acceptable. NO VBA, Excel etc.
4. How to submit: Moodle
5. At the top, please write
· Each group member’s name
· Each group member’s HKU email
· Each group member’s HKU Student ID
· What percentage each group member did
A diversity-based trading strategy?
Hong Kong is ostensibly facing a brain drain. While the overall skill of the population has increased over time, Hong Kong is concerningly becoming less international. With international experience, there lies unique expertise and connections. Elsewhere in the developed world, immigrants and expatriates are increasingly a greater percentage of the workforce. Finally, given the global move toward ESG (environment, social, governance), there is increasing demand for workers to behave in a socially responsible way, and one aspect of social / governance is increasing board and overall worker diversity (diversity, equity and inclusion).
A few people have looked at this. The Wall Street Journal, for instance, explored this idea by looking at top industries for DEI and its relationship to performance. The article is not rigorous. So, in her job market paper, Does the Market Mis-Value Non-Executive Employee Diversity? (by Shiyi Zhang), Shiyi explores the relationship between diversity and performance. I thought this was very interesting.
In this paper, we are going to replicate Shiyi’s paper (roughly).
1. Argue theoretically why diversity could be good for performance.
2. Argue theoretically why diversity could be bad for firm performance?
3. What are the issues with the WSJ backtest? Does it have look-ahead bias? Verbalize the issue.
4. Join my dataset “hw2_linkedin_diversity_data_corrected.parquet”. I have placed on Confluence (alankwan.atlassian.net) my SQL queries as to how I generated this data, but the point is that based on Linked in, my data reports the month in which the worker listed working at the company based on Linkedin. I thus formed statistics based on all workers at a company at time t. Since time t is the convention for the month of return observation, we need data from before t so we can trade by the end of t-1. I want you to join the data with the monthly stock listing file (crsp.msf_delisting_adjusted.parquet) with a six month gap, so probably need to add a little bit of time between the date in this dataset and monthly stock file.
Clarifications: the data that I have is at the gvkey level, which is Compustat’s identifier. There is a linktable called monthly_gvkey_permno_link.parquet that gives every month, the appropriate permno for a gvkey. This is a common practice. Alternatively, you can do a range join using crsp.ccmxpf_linktable where usedflag=1 and make sure to fill end missing linkenddt as still current.
· Why is the six-month gap desirable?
· What are some reasons why we might want a shorter gap? What are some problems with the data-generating process from Linkedin that might make a shorter gap more difficult to implement?
· Do we ever want the gap to be negative? That is, why do we not want data from t+N where N>0 and stock returns from time t?
Now, sort portfolios every month on the fraction of non-white (Black, Hispanic, Asian) by overall people. Filter out all stocks with t-1 price below $5, and anyone in the bottom twenty percent of market cap at t-1. Also have at least 10 employees. Sort into five portfolios (equal and value weighted). To make it relatively possible to implement we’ll take the diversity data from t-6 to trade at time t. Now compute these statistics.
a. Explain the logic of the filters on price, number of employees, and market capitalization.
b. Barplot of returns across time
c. P&L curve of the long short portfolio
d. Factor loadings. What are the portfolio alphas and factor loadings? Based on the factor loadings, what type of strategy does this behave like?
f. Assess the sensitivity of the performance of the strategy to liquidity filters. Specifically, what happens if you raise the hurdle for liquidity filters?
Also would be nice if you sharpen the signal to 10 portfolios, etc.
g. Revelio Labs was formed in 2015 and started to collect the data since inception. Its data source is Linkedin, which was founded in the late 2000s. Linkedin has nearly 1 billion profiles, some of which are fake, and are difficult to scrape on a regular basis due to things like privacy, sheer enormity / latency, and the fact that it is difficult to know what are the names of everyone on Linkedin.
Explain what issues with this back-test are in terms of implementation? There is no right or wrong answer, but do you believe the results?
h. Extra credit (1 point): Use the variation in the data in the cross-section or time series to come up with an argument for why you think this diversity premium exists. Is it due to diversity?
i. Extra credit (1 point): What are some things you can do to build confidence in the implementability of the strategy? Think through it, what would you have to do to convince yourself?
j. Extra credit (1 point): economists emphasize causality. Oft-outspoken machine learning practitioner Marcus De Lo Prado calls for “causal” factor investing to discipline machine learning models. How can we test the causal impact of diversity on stock returns?
k. Extra credit (3 point): For giggles, let’s create a portfolio. Improve the strategy in some way. You can economically motivate a strategy or construct various factors and create a regularized regression. Do not worry about offending me with respect to this paper or gender or opinions about diversity and race (although please be reasonable and remain scientific). Here you may consider taking into account the alternative diversity metrics (breakdown of race/gender by different levels of education, for example)
Documentation
· permno: Security's unique identifier.
· gvkey: Company's unique identifier.
· date_of_signal: Date used to calculate Linkedin profiles, for example if Feb 2023 is the date, then I take all employees whose enddate is missing or after Feb 2023
· n_employees: Total employees count.
· n_api_employees: Asian/Pacific Islander employees.
· n_hispanic_employees: Hispanic employees.
· n_black_employees: Black employees.
· n_female_employees: Female employees.
· n_api_female_employees: Asian/Pacific Islander female employees.
· n_hispanic_female_employees: Hispanic female employees.
· n_black_female_employees: Black female employees.
· n_male_employees: Male employees.
· n_api_male_employees: Asian/Pacific Islander male employees.
· n_hispanic_male_employees: Hispanic male employees.
· n_black_male_employees: Black male employees.
Files:
· '/quant_trading_2024_public/hw2_linkedin_diversity_data_corrected.parquet' – this is the headcount based on all people on Linkedin
· ‘../quant_trading_2024_public/hw2_linkedin_diversity_data_by_education_level.parquet' perhaps useful for your extra credit