Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Overview

Trading Tesla with Machine Learning and Sentiment Analysis

An interactive program to train a Random Forest Classifier to predict Tesla daily prices using technical indicators and sentiment scores of Twitter posts, backtesting the trading strategy and producing performance metrics.

The project leverages techniques, paradigms and data structures such as:

  • Functional and Object-Oriented Programming
  • Machine Learning
  • Sentiment Analysis
  • Concurrency and Parallel Processing
  • Direct Acyclic Graph (D.A.G.)
  • Data Pipeline
  • Idempotence

Scope

The intention behind this project was to implement the end-to-end workflow of the backtesting of an Algorithmic Trading strategy in a program with a sleek interface, and with a level of automation such that the user is able to tailor the details of the strategy and the output of the program by entering a minimal amount of data, partly even in an interactive way. This should make the program reusable, meaning that it's easy to carry out the backtesting of the trading strategy on a different asset. Furthermore, the modularity of the software design should facilitate changes to adapt the program to different requirements (i.e. different data or ML models).

Strategy Backtesting Results

The Random Forest classifier model was trained and optimised with the scikit-learn GridSearchCV module. After computing the trading signals predictions and backtesting the strategy, the following performances were recorded:

Performance Indicators Summary
Return Buy and Hold (%) 273.94
Return Buy and Hold Ann. (%) 91.5
Return Trading Strategy (%) 1555.54
Return Trading Strategy Ann. (%) 298.53
Sharpe Ratio 0.85
Hit Ratio (%) 93.0
Average Trades Profit (%) 3.99
Average Trades Loss (%) -1.15
Max Drawdown (%) -7.69
Days Max Drawdown Recovery 2

drawdown

returns

Running the Program

This is straightforward. There are very few variables and methods to initialise and call in order to run the whole program.

Let me illustrate it in the steps below:

  1. Provide the variables in download_params, a dictionary containing all the strategy and data downloading details.

    download_params = {'ticker' : 'TSLA',
                       'since' : '2010-06-29', 
                       'until' : '2021-06-02',
                       'twitter_scrape_by_account' : {'elonmusk': {'search_keyword' : '',
                                                                   'by_hashtag' : False},
                                                      'tesla': {'search_keyword' : '',
                                                                'by_hashtag' : False},
                                                      'WSJ' : {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                      'Reuters' : {'search_keyword' : 'Tesla',
                                                                   'by_hashtag' : False},
                                                      'business': {'search_keyword' : 'Tesla',
                                                                   'by_hashtag' : False},
                                                      'CNBC': {'search_keyword' : 'Tesla',
                                                               'by_hashtag' : False},
                                                      'FinancialTimes' : {'search_keyword' : 'Tesla',
                                                                          'by_hashtag' : True}},
                       'twitter_scrape_by_most_popular' : {'all_twitter_1': {'search_keyword' : 'Tesla',
                                                                           'max_tweets_per_day' : 30,
                                                                           'by_hashtag' : True}},
                       'language' : 'en'                                      
                       }
  2. Initialise an instance of the Pipeline class:

    TSLA_data_pipeline = Pipeline()
  3. Call the run method on the Pipeline instance:

    TSLA_pipeline_outputs = TSLA_data_pipeline.run()

    This will return a dictionary with the Pipeline functions outputs, which in this example has been assigned to TSLA_pipeline_outputs. It will also print messages about the status and operations of the data downloading and manipulation process.

  4. Retrieve the path to the aggregated data to feed into the Backtest_Strategy class:

    data = glob.glob('data/prices_TI_sentiment_scores/*')[0]
  5. Initialise an instance of the Backtest_Strategy class with the data variable assigned in the previous step.

    TSLA_backtest_strategy = Backtest_Strategy(data)
  6. Call the preprocess_data method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.preprocess_data()

    This method will show a summary of the data preprocessing results such as missing values, infinite values and features statistics.

From this point the program becomes interactive, and the user is able to input data, save and delete files related to the training and testing of the Random Forest model, and proceed to display the strategy backtesting summary and graphs.

  1. Call the train_model method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.train_model()

    Here you will be able to train the model with the scikit-learn GridSearchCV, creating your own parameters grid, save and delete files containing the parameters grid and the best set of parameters found.

  2. Call the test_model method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.test_model()

    This method will allow you to test the model by selecting one of the model's best parameters files saved during the training process (or the "default_best_param.json" file created by default by the program, if no other file was saved by the user).

    Once the process is complete, it will display the testing summary metrics and graphs.

    If you are satisfied with the testing results, from here you can display the backtesting summary, which equates to call the next and last method below. In this case, the program will also save a csv file with the data to compute the strategy performance metrics.

  3. Call the strategy_performance method on the Backtest_Strategy instance:

    TSLA_backtest_strategy.strategy_performance()

    This is the method to display the backtesting summary shown above in this document. Assuming a testing session has been completed and there is a csv file for computing the performance metrics, the program will display the backtesting results straight away using the existing csv file, which in turn is overwritten every time a testing process is completed. Otherwise, it will prompt you to run a training/testing session first.

Tips

If the required data (historical prices and Twitter posts) have been already downloaded, the only long execution time you may encounter is during the model training: the larger the parameters grid search, the longer the time. I recommend that you start getting confident with the program by using the data already provided within the repo (backtesting on Tesla stock).

This is because any downloading of new data on a significantly large period of time such to be reliable for the model training will likely require many hours, essentially due to the Twitter scraping process.

That said, please be also aware that as soon as you change the variables in the download_params dictionary and run the Pipeline instance, all the existing data files will be overwritten. This is because the program recognise on its own the relevant data that need to be downloaded according to the parameters passed into download_params, and this is a deliberate choice behind the program design.

That's all! Clone the repository and play with it. Any feedback welcome.

Disclaimer

Please be aware that the content and results of this project do not represent financial advice. You should conduct your own research before trading or investing in the markets. Your capital is at risk.

References

Owner
Renato Votto
Renato Votto
This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

Big Data and Multi-modal Computing Group, CRIPAC 47 Dec 30, 2022
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

Puesta en Producción de un modelo de aprendizaje automático con Flask y Heroku L

Jesùs Guillen 1 Jun 03, 2022
🎛 Distributed machine learning made simple.

🎛 lazycluster Distributed machine learning made simple. Use your preferred distributed ML framework like a lazy engineer. Getting Started • Highlight

Machine Learning Tooling 44 Nov 27, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 08, 2023
Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

BCG Gamma 26 Nov 06, 2022
Datetimes for Humans™

Maya: Datetimes for Humans™ Datetimes are very frustrating to work with in Python, especially when dealing with different locales on different systems

Timo Furrer 3.4k Dec 28, 2022
Timeseries analysis for neuroscience data

=================================================== Nitime: timeseries analysis for neuroscience data ===============================================

NIPY developers 212 Dec 09, 2022
MegFlow - Efficient ML solutions for long-tailed demands.

Efficient ML solutions for long-tailed demands.

旷视天元 MegEngine 371 Dec 21, 2022
Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library

Multiple-Linear-Regression-master - A python program to implement Multiple Linear Regression using the LinearRegression class from sklearn.linear model library

Kushal Shingote 1 Feb 06, 2022
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
Solve automatic numerical differentiation problems in one or more variables.

numdifftools The numdifftools library is a suite of tools written in _Python to solve automatic numerical differentiation problems in one or more vari

Per A. Brodtkorb 181 Dec 16, 2022
slim-python is a package to learn customized scoring systems for decision-making problems.

slim-python is a package to learn customized scoring systems for decision-making problems. These are simple decision aids that let users make yes-no p

Berk Ustun 37 Nov 02, 2022
CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

SmartSim Example Zoo This repository contains CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning appl

Cray Labs 14 Mar 30, 2022
This project has Classification and Clustering done Via kNN and K-Means respectfully

This project has Classification and Clustering done Via kNN and K-Means respectfully. It later tests its efficiency via F1/accuracy/recall/precision for kNN and Davies-Bouldin Index for Clustering. T

Mohammad Ali Mustafa 0 Jan 20, 2022
ML-powered Loan-Marketer Customer Filtering Engine

In Loan-Marketing business employees are required to call the user's to buy loans of several fields and in several magnitudes. If employees are calling everybody in the network it is also very length

Sagnik Roy 13 Jul 02, 2022
Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Sean Zahller 1 Feb 04, 2022
This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021
Python Machine Learning Jupyter Notebooks (ML website)

Python Machine Learning Jupyter Notebooks (ML website) Dr. Tirthajyoti Sarkar, Fremont, California (Please feel free to connect on LinkedIn here) Also

Tirthajyoti Sarkar 2.6k Jan 03, 2023
A machine learning project that predicts the price of used cars in the UK

Car Price Prediction Image Credit: AA Cars Project Overview Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup. Cleaned t

Victor Umunna 7 Oct 13, 2022