Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Overview

Trading Tesla with Machine Learning and Sentiment Analysis

An interactive program to train a Random Forest Classifier to predict Tesla daily prices using technical indicators and sentiment scores of Twitter posts, backtesting the trading strategy and producing performance metrics.

The project leverages techniques, paradigms and data structures such as:

  • Functional and Object-Oriented Programming
  • Machine Learning
  • Sentiment Analysis
  • Concurrency and Parallel Processing
  • Direct Acyclic Graph (D.A.G.)
  • Data Pipeline
  • Idempotence

Scope

The intention behind this project was to implement the end-to-end workflow of the backtesting of an Algorithmic Trading strategy in a program with a sleek interface, and with a level of automation such that the user is able to tailor the details of the strategy and the output of the program by entering a minimal amount of data, partly even in an interactive way. This should make the program reusable, meaning that it's easy to carry out the backtesting of the trading strategy on a different asset. Furthermore, the modularity of the software design should facilitate changes to adapt the program to different requirements (i.e. different data or ML models).

Strategy Backtesting Results

The Random Forest classifier model was trained and optimised with the scikit-learn GridSearchCV module. After computing the trading signals predictions and backtesting the strategy, the following performances were recorded:

Performance Indicators Summary
Return Buy and Hold (%) 273.94
Return Buy and Hold Ann. (%) 91.5
Return Trading Strategy (%) 1555.54
Return Trading Strategy Ann. (%) 298.53
Sharpe Ratio 0.85
Hit Ratio (%) 93.0
Average Trades Profit (%) 3.99
Average Trades Loss (%) -1.15
Max Drawdown (%) -7.69
Days Max Drawdown Recovery 2

drawdown

returns

Running the Program

This is straightforward. There are very few variables and methods to initialise and call in order to run the whole program.

Let me illustrate it in the steps below:

  1. Provide the variables in download_params, a dictionary containing all the strategy and data downloading details.

    download_params = {'ticker' : 'TSLA',
                       'since' : '2010-06-29', 
                       'until' : '2021-06-02',
                       'twitter_scrape_by_account' : {'elonmusk': {'search_keyword' : '',
                                                                   'by_hashtag' : False},
                                                      'tesla': {'search_keyword' : '',
                                                                'by_hashtag' : False},
                                                      'WSJ' : {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                      'Reuters' : {'search_keyword' : 'Tesla',
                                                                   'by_hashtag' : False},
                                                      'business': {'search_keyword' : 'Tesla',
                                                                   'by_hashtag' : False},
                                                      'CNBC': {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                      'FinancialTimes' : {'search_keyword' : 'Tesla',
                                                                          'by_hashtag' : True}},
                       'twitter_scrape_by_most_popular' : {'all_twitter_1': {'search_keyword' : 'Tesla',
                                                                           'max_tweets_per_day' : 30,
                                                                           'by_hashtag' : True}},
                       'language' : 'en'                                      
                       }
  2. Initialise an instance of the Pipeline class:

    TSLA_data_pipeline = Pipeline()
  3. Call the run method on the Pipeline instance:

    TSLA_pipeline_outputs = TSLA_data_pipeline.run()

    This will return a dictionary with the Pipeline functions outputs, which in this example has been assigned to TSLA_pipeline_outputs. It will also print messages about the status and operations of the data downloading and manipulation process.

  4. Retrieve the path to the aggregated data to feed into the Backtest_Strategy class:

    data = glob.glob('data/prices_TI_sentiment_scores/*')[0]
  5. Initialise an instance of the Backtest_Strategy class with the data variable assigned in the previous step.

    TSLA_backtest_strategy = Backtest_Strategy(data)
  6. Call the preprocess_data method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.preprocess_data()

    This method will show a summary of the data preprocessing results such as missing values, infinite values and features statistics.

From this point the program becomes interactive, and the user is able to input data, save and delete files related to the training and testing of the Random Forest model, and proceed to display the strategy backtesting summary and graphs.

  1. Call the train_model method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.train_model()

    Here you will be able to train the model with the scikit-learn GridSearchCV, creating your own parameters grid, save and delete files containing the parameters grid and the best set of parameters found.

  2. Call the test_model method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.test_model()

    This method will allow you to test the model by selecting one of the model's best parameters files saved during the training process (or the "default_best_param.json" file created by default by the program, if no other file was saved by the user).

    Once the process is complete, it will display the testing summary metrics and graphs.

    If you are satisfied with the testing results, from here you can display the backtesting summary, which equates to call the next and last method below. In this case, the program will also save a csv file with the data to compute the strategy performance metrics.

  3. Call the strategy_performance method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.strategy_performance()

    This is the method to display the backtesting summary shown above in this document. Assuming a testing session has been completed and there is a csv file for computing the performance metrics, the program will display the backtesting results straight away using the existing csv file, which in turn is overwritten every time a testing process is completed. Otherwise, it will prompt you to run a training/testing session first.

Tips

If the required data (historical prices and Twitter posts) have been already downloaded, the only long execution time you may encounter is during the model training: the larger the parameters grid search, the longer the time. I recommend that you start getting confident with the program by using the data already provided within the repo (backtesting on Tesla stock).

This is because any downloading of new data on a significantly large period of time such to be reliable for the model training will likely require many hours, essentially due to the Twitter scraping process.

That said, please be also aware that as soon as you change the variables in the download_params dictionary and run the Pipeline instance, all the existing data files will be overwritten. This is because the program recognise on its own the relevant data that need to be downloaded according to the parameters passed into download_params, and this is a deliberate choice behind the program design.

That's all! Clone the repository and play with it. Any feedback welcome.

Disclaimer

Please be aware that the content and results of this project do not represent financial advice. You should conduct your own research before trading or investing in the markets. Your capital is at risk.

References

Owner
Renato Votto
Renato Votto
Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validat

The Apache Software Foundation 121 Dec 28, 2022
A model to predict steering torque fully end-to-end

torque_model The torque model is a spiritual successor to op-smart-torque, which was a project to train a neural network to control a car's steering f

Shane Smiskol 4 Jun 03, 2022
#30DaysOfStreamlit is a 30-day social challenge for you to build and deploy Streamlit apps.

30 Days Of Streamlit 🎈 This is the official repo of #30DaysOfStreamlit — a 30-day social challenge for you to learn, build and deploy Streamlit apps.

Streamlit 53 Jan 02, 2023
Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses 2.3k Jan 03, 2023
虚拟货币(BTC、ETH)炒币量化系统项目。在一版本的基础上加入了趋势判断

🎉 第二版本 🎉 (现货趋势网格) 介绍 在第一版本的基础上 趋势判断,不在固定点位开单,选择更优的开仓点位 优势: 🎉 简单易上手 安全(不用将api_secret告诉他人) 如何启动 修改app目录下的authorization文件

幸福村的码农 250 Jan 07, 2023
MLR - Machine Learning Research

Machine Learning Research 1. Project Topic 1.1. Exsiting research Benmark: https://paperswithcode.com/sota ACL anthology for NLP papers: http://www.ac

Charles 69 Oct 20, 2022
A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

The Alan Turing Institute 6k Jan 06, 2023
Bayesian Modeling and Computation in Python

Bayesian Modeling and Computation in Python Open access and Code This repository contains the open access version of the text and the code examples in

Bayesian Modeling and Computation in Python 339 Jan 02, 2023
Responsible AI Workshop: a series of tutorials & walkthroughs to illustrate how put responsible AI into practice

Responsible AI Workshop Responsible innovation is top of mind. As such, the tech industry as well as a growing number of organizations of all kinds in

Microsoft 9 Sep 14, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 03, 2023
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 09, 2023
Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python Overview Bank Jago has attracted investors' attention since the end

Najibulloh Asror 3 Feb 10, 2022
Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Sean Zahller 1 Feb 04, 2022
Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification Introduction. This package includes the pyth

5 Dec 06, 2022
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Target 696 Dec 26, 2022
Graphsignal is a machine learning model monitoring platform.

Graphsignal is a machine learning model monitoring platform. It helps ML engineers, MLOps teams and data scientists to quickly address issues with data and models as well as proactively analyze model

Graphsignal 143 Dec 05, 2022
PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

Julian Kappler 2 May 20, 2022
MasTrade is a trading bot in baselines3,pytorch,gym

mastrade MasTrade is a trading bot in baselines3,pytorch,gym idea we have for example 1 btc and we buy a crypto with it with market option to trade in

Masoud Azizi 18 May 24, 2022
Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

Claudia dos Anjos 1 Jan 05, 2022
This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

Big Data and Multi-modal Computing Group, CRIPAC 47 Dec 30, 2022