A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

Overview

A tour through tensorflow with financial data

I present several models ranging in complexity from simple regression to LSTM and policy networks. The series can be used as an educational resource for tensorflow or deep learning, a reference aid, or a source of ideas on how to apply deep learning techniques to problems that are outside of the usual deep learning fields (vision, natural language).

Not all of the examples will work. Some of them are far to simple to even be considered viable trading strategies and are only presented for educational purposes. Others, in the notebook form I present, have not been trained for the proper amount of time. Perhaps with a bit of rented GPU time they will be more promising and I leave that as an excercise for the reader. Hopefully this project inspires some to try using deep learning techniques for some more interesting problems. Contact me if interested in learning more or if you have suggestions for additions or improvements.

The algorithms increase in complexity and introduce new concepts as they progress:

Simple Regression (notebook)

Here we regress the prices from the last 100 days to the next day's price, training W and b in the equation y = Wx + b where y is the next day's price, x is a vector of dimension 100, W is a 100x1 matrix and b is a 1x1 matrix. We run the gradient descent algorithm to minimize the mean squared error of the predicted price and the actual next day price. Congratulations, you passed highschool stats. But hopefully this simple and naive example helps demonstrate the idea of a tensor graph, as well as showing a great example of extreme overfitting.

Simple Regression on Multiple Symbols (notebook)

Things get a little more interesting as soon as we introduce more than one symbol. What is the best way to model our eventual investment strategy? We start to realize that our model only vaguely implies a policy (investment actions) by predicting the actual movement in price. The implied policy is simple: buy if the the predicted price movement is positive, sell if it is negative. But that doesnt sound realistic at all. How much do we buy? And will optimizing this, even if we are very careful to avoid overfitting, even produce results that allign with our goals? We havent actaully defined our goals explicitly, but for those who are not familiar with investment metrics, common goals include:

  • maximize risk adjusted return (like the Sharpe ratio)
  • consistency of returns over time
  • low market exposure
  • long/short equity

If markets were easy to figure out and we could accurately predict the next day's return then it wouldn't matter. Our implied policy would fit with some goals (not long/short equity though) and the strategy would be viable. The reality is that our model cannot accurately predict this, nor will our strategy ever be perfect. Our best case scenario is always just winning slightly more than losing. When operating on these margins it is much more important that we consider the policy explicitly, thus moving to 'Policy Based' deep learning.

Policy Gradient Training (notebook)

Our policy will remain simple. We will chose a position, long/neutral/short, for each symbol in our portfolio. But now, instead of letting our estimation of the future return inform our decision, we train our network to choose the best position. Thus, instead of having an implied policy, it is explicit and trained directly. Even thought the policy is simple in this case, training it is a bit more involved. I did my best to interpret Andrej Karpathy's excelent article on Reinforcement Learning when writing this code. It might be worth reading his explanation, but I'll do my best to summarize what I did.

We update our regression engine so that its output, y, is a vector in dimension [batch_size, number_positions x number_symbols] (a long, short, neutral bucket for each symbol).

For each symbol in our portfolio, we sample the probability distribution of our three position buckets to get our policy decision (a position long/short/neutral), we multiply our decision (element of {-1,0,1}) by the target value to get a daily return for the symbol. Then we add that value for all the symbols to get a full daily return. We can also get other metrics like the total return and sharpe ratio since we actually are feeding this through as a batch (more on that later). As Karpathy points out, we are only interested in the gradients of the positions we sampled, so we select the appropriate columns from the output and combine them into a new tensor.

This part of the code was a bit tricky for several reasons. First off, we have a loop through and isolate each symbol since I need one position per symbol. I also am using multinomial probability distributions, so I need to take a softmax of those values. Softmax pushes values so that they sum to 1, and therefore can represent a probability distribution. In pseudocode: softmax[i, j] = exp(logits[i, j]) / sum(exp(logits[i])).

for i in range(len(symbol_list)):
    symbol_probs = y[:,i*num_positions:(i+1)*num_positions]
    symbol_probs_softmax = tf.nn.softmax(symbol_probs)

Next, we sample that probability distribution. Even though the code is a nice one-liner due to tensorflow's multinomial function, the function is NOT DIFFERENTIABLE, meaning that we will not be able to "move through" this step durring back propogation. We calcultate the position vector simply by subtracting 1 from the column indices that we got from the sample so that we get and element of {-1,0,1}.

pos = {}
for i in range(len(symbol_list)):
    # ISOLATE from before
    # ...
    sample = tf.multinomial(tf.log(symbol_probs_softmax), 1)
    pos[i] = tf.reshape(sample, [-1]) - 1   # choose(-1,0,1)

Then we multiply that position by the target (future return) for each day. This gives us our return. It already looks like a cost function but remember that it's not differentiable.

symbol_returns = {}
for i in range(len(symbol_list)):
    # ISOLATE and SAMPLE from before
    # ...
    symbol_returns[i] = tf.mul(tf.cast(pos[i], float32),  y_[:,i])

Finally, we isolate the relevant column (the one we chose in our sample) from our probability distribution. The idea isn't very difficult but the code was a bit tough, remember that we are dealing with a whole batch of outputs at a time. This step NEEDS to be differentiable since we will use this tensor to compute our gradients. Unfortunately tensorflow is still developing a function that does it by itself, here is the discussion. I actually think that my solution is the best and I suggest it in the discussion, but I'm really not an expert at efficient computation. If anyone thinks of a more efficient solution or if tensorflow finishes theirs please let me know.

relevant_target_column = {}
for i in range(len(symbol_list)):
    # ...
    # ...
    sample_mask = tf.reshape(tf.one_hot(sample, 3), [-1,3])
    relevant_target_column[i] = tf.reduce_sum(symbol_probs_softmax * sample_mask,1)

So here is all of that together:

# loop through symbols, taking the buckets for one symbol at a time
pos = {}
symbol_returns = {}
relevant_target_column = {}
for i in range(len(symbol_list)):
    # ISOLATE the buckets relevant to the symbol and get a softmax as well
    symbol_probs = y[:,i*num_positions:(i+1)*num_positions]
    symbol_probs_softmax = tf.nn.softmax(symbol_probs)
    # SAMPLE probability to chose our policy's action
    sample = tf.multinomial(tf.log(symbol_probs_softmax), 1)
    pos[i] = tf.reshape(sample, [-1]) - 1   # choose(-1,0,1)
    # GET RETURNS by multiplying the policy (position taken) by the target return (y_) for that day
    symbol_returns[i] = tf.mul(tf.cast(pos[i], float32),  y_[:,i])
    # isolate the output probability the selected policy (for use in calculating gradient)
    sample_mask = tf.reshape(tf.one_hot(sample, 3), [-1,3])
    relevant_target_column[i] = tf.reduce_sum(symbol_probs_softmax * sample_mask,1)

So now we have a tensor with the regression's probability for the chosen (sampled) action for each symbol and each day. We also have a few performance metrics like daily and total return to choose from, but they're not differentiable because we sampled the probability so we cant just "gradient descent maximize" the profit...unfortunately. Instead, we find the sigmoid cross entropy (a sort of distance function) between the first table (the probabilities we chose/sampled) and an all-ones tensor of the same shape. We get a table of cross entropies of the same size (number of symbols by batch size) This is basically equivalent to saying, how do I do MORE of what I'm already doing, for every decision that I made.

training_target_cols = tf.concat(1, [tf.reshape(t, [-1,1]) for t in relevant_target_column.values()])
ones = tf.ones_like(training_target_cols)
gradient = tf.nn.sigmoid_cross_entropy_with_logits(training_target_cols, ones)

Now we dont necessarilly want MORE of what we're doing, but the opposite of it is definitely LESS of it, which is useful. We multiply that tensor by our fitness function (the daily or aggregate return) and we use the gradient descent optimizer to minimize the cost. So you see? If the fitness function is negative, it will train the weights of the regression to NOT do what it just did. Its a pretty cool idea and it can be applied to a lot of problems that are much more interesting. I give some examples in the notebook about which different fitness functions you can apply which I think is better explained by seeing it. Here is that code:

cost = tf.mul(gradient , returns)        #returns are some reshaped version of symbol_returns (from above)
optimizer = tf.train.GradientDescentOptimizer(0.1).minimize(cost)

Stochastic Gradient Descent (notebook)

As you saw in the notebook, the policy gradient doesnt train very well when we are grading it on the return over the entire dataset, but it trains very well when it uses each day's return or the position on each symbol every day. This makes sense, if we just take the total return over several years and its slightly positive then we tell our machine to do more of that. That will do almost nothing since so many of those decisions were actually losing money. The problem, as we have it set up now, needs to be broken down into smaller units. Fortunately there is some mathematical proof that this is legal and even faster. Score!

Stochastic Gradient Descent is basically just breaking your data into smaller batches and doing gradient descent on each one. It will have slightly less accurate gradients WRT to the entire dataset's cost function, but since you are able to iterate faster with smaller batches you can run way more of them. There might even be more advantages to SGD that I'm not even mentioning, so you can read the wikipedia or hear Andrew Ng talk about it. Or just use it since it works and its faster. If you're going on a wikipedia learning binge you might as well also learn about Momentum and Adagrad which are just variations. The latter two only really useful for people doing much bigger projects. If you are working on a huge project and your twobucksanhour AWS GPU instance is too slow then you should definitely be using them (and not be reading this introductory tutorial). At the higher level, the problem of tuning the optimizer and overall efficiency have been thoroughly researched.

Multi Sampling (notebook)

Since we are sampling the policy, we can sample repeatedly in order to compute better. Karpathy's article summarizes the math behind this nicely and this paper is worth reading. The concept is intuitive and simple, but getting the math to work out and the tensor graph in order is very involved. One realizes that a mastery of numpy and a solid understanding of linear algebra are very important to tensorflow once the problems get...deeper, I guess is the word.

Multi sampling adds a useful computational kick that lets the network train much more efficiently. The results are already impressive. Using batches of less than 75 days and only training on the total return over that timeframe, we are able to "overfit" our network. Keep in mind that all we are doing is telling the network to do more of what it is doing when it does well, and less when it does poorly! Sure, we are still far away from having anything worthwhile out of sample, but that is because we are still using linear regression.

By now you are probably either wondering does this guy even know what deep learning is? I havent seen a single neural network! or you completely forgot we were still using the same linear regression that 16 year olds learn in math class. Well, we'll get to neural networks next but I wanted to talk about other things before neural networks to show how much tensorflow can be used for before neural networks even get mentioned, and to show how much important math exists in deep learning that has nothing to do with neural networks. Tensorflow makes neural nets so easy that you barely even notice that they're part of the picture and its definitely not worth getting bogged down by their math if you dont have a solid understanding of the math behind cross entropy, policy gradients and the like. They probably even distract from where the true difficulty is. Maybe I'll try to get a regression to play pong so that everyone shuts about neural networks and starts talking about policy learning...

Neural Networks (notebook)

So we finanlly get to it. Here's the same thing with a neural network. Its the simplest kind of net that there is but it is still very powerful. Way more powerful that our puny regression ever was becuase it has nonlinearities (RELU layers) between the other layers (which are basically just regressions by themselves). A sequence of linear regressions is still obviously linear1. But if we put a nonlinearity between the layers, then our net can do anything. Thats what gets people excited about neural networks, becuase they can hold enourmous amounts of information when trained well. Fortunatly, we just learned a bunch of cool ways to train them in steps 1-5. Now putting in the network is very easy. I really changed nothing except the Variable and the equation really is basically still y = Wx + b except non-linear W.

The reason I introduced the networks so late is becuase they can be a bit difficult to tune. Chosing the right size of your network and the right training step can be difficult and sometimes it is helpful to start out simple until you have all the bells and whistles in place.

In steps 3-5, we spent a lot of time figuring out tricks to do with the training step, which is a widely researched area at the moment and is probably more relevant to algorithmic trading that anything else. Now we are starting to demonstrate some of the techniques used in the prediction engine (regression before, neural network now). I believe this is a much more researched area and TensorFlow is better equiped for it. Many people describe the types of neural networks that we will learn as cells or legos. You dont need to think that much about how it works as long as you know what it does. If you noticed, thats what I did with the neural network. There is a lot more to learn and its worth learning, but when you're actually building with it, you dont think about RELU layers as much as input/output and a black box in the middle. Or at least I do...there are a bunch of people in image processing who look inside networks and do very cool things.

Regularization and Modularization (notebook)

From Wikipedia)), "regularization is the introduction of additional information in order to solve an ill-posed problem or to prevent overfitting." For our purposes, it is any technique used to reduce overfitting. One of the simplest yet most important examples is early stopping, that is, ending gradient descent before there is no significant gain in evaluation performance. The idea is that once the model stops improving, it will start to overfit the training data. Most overfitting can be avoided with just this technique.

We have more overfitting problems. Financial data is very noisey with faint signals that are very complex. Markets are zero sum and there are already an enormous number of very smart people getting paid a very large amount of money to develop trading strategies. There are many different patterns that can be learned but we can assume that all of the simple ones have been found (and therefore traded out of the market). Also, the market structure and properties change with time so even if we found a great pattern for the past five years, it might not work a month from now.

We therefore need even more strategies to prevent overfitting. One strategy is called L2 regularization. Basically, we punish a network for having very large weights by adding the L2 norm of the weights, 1/2(||W||^2_2), times a constant B (to determine how much we want to regularize). It allows us to control the level of model complexity.

Another method is called dropout regularization, a technique developed by Geoffrey Hinton. Overfitting is avoided by randomly ommitting half of the neurons during each training iteration. It sounds kooky but the idea is that it avoids the co-adaptation of features in the neural network, so that recognizing a certain set of features does not imply another set, as is the case in many overtrained nets. This way, each neuron will learn to detect a feature that is generally helpful. In validation and normal use, the neurons are not dropped so as to use all informaation. The technique makes the model more robust, avoids overfitting, and essentially turns your network into an ensemble that reaches consensus. Tensorflow's dropout layer includes the scaling required to safely ensemble when it comes time for validation.

Since we want the same model with two different configurations--one with dropout active and one without--we will now pursue our long-overdue duty of refactoring our code. Although the code is presented in a single notebook, keep in mind that what I am essentially doing is modularizing. I won't claim that I am the most organized with my code, but I tried to keep it consistent with the best practices that I have observed others using with their open projects. Moving forward we will keep this structure becuase it is clearly superior to the organization that I had used before.

LSTM (notebook)

My favorite neural network, and a true stepping stone into real deep learning is the long short-term memory network, or LSTM. Colah wrote an incredibly clear explanation of LSTM and there is really no substitute to reading his post. To describe the setup as briefly as possible, you input the data one timestep at a time to the LSTM cell. And each timestep the cell not only recieves the new input, but it recieves the last timestep's output and what is called the cell state, a vector that carries information about what happened in the past. Within the cell you have trained gates (basically small neural nets) that decide, based on the three inputs, what to forget from the past cell state, what to remember (or add) to the new state, and what to output this timestep. It is a very powerful tool and fascinating in how effective it is.

I am now pretty far into this series and I have a pretty good idea of where it will go from here. These are the problems that I must tackle in no particular order:

  • new policies such as
    • long/short equality amongs two symbols and more
    • spread trading (if that is different from above)
    • minimize correlation/ new risk meaesure that is appropriate for large number of symbols
  • migrating to AWS and using GPU computing power
  • ensebling large numbers of strategies that are generated with the same code
    • policy grads find local maxima so no reason not to use that to my advantage
  • testing suite to be able to test if the strategies are viable objectively
  • convolution nets, especially among larger groups of symbols we can expect that some patterns are fractal
  • turning it into a more formal project or web app

And of course we can start moving to other sources of data:

  • text
  • scraping
  • games

Stay tuned for some articles that I will write about the algorithms used here and a discussion of the difficulties of using these techniques for algorithmic trading developement.

1: I hope you didnt sleep through Linear Algebra class! I owe all my LA skills to Comrade Otto Bretscher of Colby College whose class I did sleep through but whose text book is worth its weight in gold.

Implementation of CSRL from the AAAI2022 paper: Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning

CSRL Implementation of CSRL from the AAAI2022 paper: Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning Python: 3

4 Apr 14, 2022
End-to-end Temporal Action Detection with Transformer. [Under review]

TadTR: End-to-end Temporal Action Detection with Transformer By Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Song Bai, Xiang Bai. This repo holds the c

Xiaolong Liu 105 Dec 25, 2022
atmaCup #11 の Public 4th / Pricvate 5th Solution のリポジトリです。

#11 atmaCup 2021-07-09 ~ 2020-07-21 に行われた #11 [初心者歓迎! / 画像編] atmaCup のリポジトリです。結果は Public 4th / Private 5th でした。 フレームワークは PyTorch で、実装は pytorch-image-m

Tawara 12 Apr 07, 2022
The aim of this project is to build an AI bot that can play the Wordle game, or more generally Squabble

Wordle RL The aim of this project is to build an AI bot that can play the Wordle game, or more generally Squabble I know there are more deterministic

Aditya Arora 3 Feb 22, 2022
Generalized and Efficient Blackbox Optimization System.

OpenBox Doc | OpenBox中文文档 OpenBox: Generalized and Efficient Blackbox Optimization System OpenBox is an efficient and generalized blackbox optimizatio

DAIR Lab 238 Dec 29, 2022
A PyTorch implementation of the Relational Graph Convolutional Network (RGCN).

Torch-RGCN Torch-RGCN is a PyTorch implementation of the RGCN, originally proposed by Schlichtkrull et al. in Modeling Relational Data with Graph Conv

Thiviyan Singam 66 Nov 30, 2022
Data Augmentation Using Keras and Python

Data-Augmentation-Using-Keras-and-Python Data augmentation is the process of increasing the number of training dataset. Keras library offers a simple

Happy N. Monday 3 Feb 15, 2022
Guiding evolutionary strategies by (inaccurate) differentiable robot simulators @ NeurIPS, 4th Robot Learning Workshop

Guiding Evolutionary Strategies by Differentiable Robot Simulators In recent years, Evolutionary Strategies were actively explored in robotic tasks fo

Vladislav Kurenkov 4 Dec 14, 2021
Visualize Camera's Pose Using Extrinsic Parameter by Plotting Pyramid Model on 3D Space

extrinsic2pyramid Visualize Camera's Pose Using Extrinsic Parameter by Plotting Pyramid Model on 3D Space Intro A very simple and straightforward modu

JEONG HYEONJIN 106 Dec 28, 2022
NeurIPS-2021: Neural Auto-Curricula in Two-Player Zero-Sum Games.

NAC Official PyTorch implementation of NAC from the paper: Neural Auto-Curricula in Two-Player Zero-Sum Games. We release code for: Gradient based ora

Xidong Feng 19 Nov 11, 2022
PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Long Short-Term Transformer for Online Action Detection Introduction This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short

77 Dec 16, 2022
Image segmentation with private İstanbul Dataset

Image Segmentation This repo was created for academic research and test result. Repo will update after academic article online. This repo contains wei

İrem KÖMÜRCÜ 9 Dec 11, 2022
Seach Losses of our paper 'Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search', accepted by ICLR 2021.

CSE-Autoloss Designing proper loss functions for vision tasks has been a long-standing research direction to advance the capability of existing models

Peidong Liu(刘沛东) 54 Dec 17, 2022
ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS.

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS. It currently supports four examples for you to quickly experience the power of ONNX Runti

Microsoft 58 Dec 18, 2022
TensorFlow implementation of the algorithm in the paper "Decoupled Low-light Image Enhancement"

Decoupled Low-light Image Enhancement Shijie Hao1,2*, Xu Han1,2, Yanrong Guo1,2 & Meng Wang1,2 1Key Laboratory of Knowledge Engineering with Big Data

17 Apr 25, 2022
classification task on dataset-CIFAR10,by using Tensorflow/keras

CIFAR10-Tensorflow classification task on dataset-CIFAR10,by using Tensorflow/keras 在这一个库中,我使用Tensorflow与keras框架搭建了几个卷积神经网络模型,针对CIFAR10数据集进行了训练与测试。分别使

3 Oct 17, 2021
👨‍💻 run nanosaur in simulation with Gazebo/Ingnition

🦕 👨‍💻 nanosaur_gazebo nanosaur The smallest NVIDIA Jetson dinosaur robot, open-source, fully 3D printable, based on ROS2 & Isaac ROS. Designed & ma

nanosaur 9 Jul 19, 2022
University of Rochester 2021 Summer REU focusing on music sentiment transfer using CycleGAN

Music-Sentiment-Transfer University of Rochester 2021 Summer REU focusing on music sentiment transfer using CycleGAN Poster: Music Sentiment Transfer

Miles Sigel 2 Jan 24, 2022
Show-attend-and-tell - TensorFlow Implementation of "Show, Attend and Tell"

Show, Attend and Tell Update (December 2, 2016) TensorFlow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attent

Yunjey Choi 902 Nov 29, 2022
DenseNet Implementation in Keras with ImageNet Pretrained Models

DenseNet-Keras with ImageNet Pretrained Models This is an Keras implementation of DenseNet with ImageNet pretrained weights. The weights are converted

Felix Yu 568 Oct 31, 2022