HerdSecuritiesModel
Overview
This is a fun RNN model I was developing while I was working at Amazon. My ML team was doing a team building project developing a model to forecast equity options that would move in response to (or at least forecastable by) social media chatter. The project was a response to GameStonk, which was heavily driven by social contagion. I wrote the following as the template for a simple baseline model that we could use to check whether deeper development was warranted.
RNN Input Feature Matrix
My hypothesis here is that the sequence of temporal observations are inherently temporal. No useful information can be propagated backward through time, so we need not use a CNN or other approach that allows simultaneous input of values at different times.
For each time T, the RNN (LSTM, GRU, etc) takes an input matrix that looks like the following.
Conditions are things like price, volume, T90D price, whatever.
While the RNN has time flowing strictly in one direction, the rows of scalars for market conditions or underlying, call, or put conditions could contain historical values at different times (T7D, T30D, T90D, T12M). So you could embed representations of multiple times that are intended to be processed in parallel within the temporal frames that are processed sequentially.
Side Note: Nothing makes full rows special. As long as we're consistent, we could have 11 Market observations and 15 underlying observations, for example. Presenting it as a matrix is just a convenient visual representation.
Other Side Note: RNNs typically have vectors, not matrices, of input features. That's fine - just unroll the matrix into a big vector.
How Do We Build It
For each time T1, grab each of the top M Redditors comments from T0 to T1 that are related to any Security S. For each Redditor, pass their set of comments since T0 through an NLP that generates an embedding of their sentiment toward the equity. That is the row Rm1 through Rm8 (or 16, or 32, or whatever - but think about the memory and CPU cost).
For the market, underlying, and option stats, just gather them from the data source, transform them, and stick them in.
Repeat training for each Security S, test against a reserved set of data, tune hyperparameters.
Be careful of overindexing on outliers. While major outliers are the most powerful examples, and may have the strongest input feature signals, they are also rare. The limited number of samples presents a significant risk of memorization.
Quantitative Features
Market Conditions
- Year
- Month
- Day
- Day of Week
- Market Index 1
- Market Index 2
- more...
Underlying Conditions
- Price
- T24H Price
- T7D Price
- T30D Price
- Volume
- Recent Volume
- more...
Option Conditions
- Price
- Volume
- Recent Price
- Recent Volume
- more...
Target?
What is the objective? What are we going to train against?
I'm interested in looking at the price change over the next 24 hours on Puts, Calls, and Straddles as potential targets. I want to train models for each and see which one has the best performance under simulated market conditions.
While making a prediction on options should allow us to make a variety of trades, I will be focusing on buying Puts, Calls, or Straddles at first.
I think the strike should be close to the underlying price, but I don't know what "close" means and I don't have data to support that opinion yet. Need a data analysis for that.
I want to stick with 24H so that I can get data fast and also keep the stress manageable. If I'm always exiting at 24H, I don't have to think about whether I can time the market.
That doesn't mean that I actually sell at 24H all the time. If I have Calls and 24H arrives and the model says, "Buy Calls", I would just keep the ones I have.
OK, so if I'm going to exit as soon after 24H as I can, then I need two things:
- Sufficient volume to get out quickly.
- A relatively short expiration (maybe 10 days at the most). Options with a short expiration are more sensitive to market changes, so we're increasing the sensitivity to the predictions of the model.
Conclusion
- The target should be the 24H change in price.
- We should target options with an expiration between 3 and 9 days.
- We should target options with a median of at least 20 sales over the T28D (trailing 28 days). I have seen evidence for this - roughly speaking, 20 median sales is the knee of he curve for guaranteeing that there will be at least 10 sales on the target sale day.
- To decide whether we should target Puts, Calls, or Straddles, we should build the model and see which predictions give the best ROI.
- To decide what strike range we should target, we should build the model and see what range gives the best ROI (presumably for each option type or structure type under consideration).