Both social media sentiment and stock market data are crucial for stock price prediction

Overview

Relating Social Media to Stock Movements_DA-31st-December

Both social media sentiment and stock market data are crucial for stock price prediction. So, in this project we analyzed the dynamics of stock markets based on both social media news (text data) and stock prices (numerical data).

Understanding the Dataset

The dataset we are working on is a combination of Wallstreetbets-Reddit news and the Standard & Poor’s 500 (S&p 500) stock price from 2013 to 2018.

  • The news dataset contains the top 25 news from Reddit on each day from 2013 to 2018.

  • The S&P 500 contains the core stock market information for each day such as Open, Close, and Volume.

  • The SCORE of the dataset is whether the stock price is increase (labeled as 1) or decrease (labeled as 0) on that day.

EDA

Introduction:

  • data dataset comprises 5698 rows and 8 columns.
  • Dataset consists of continuous variable and float data type.
  • Dataset column variables 'Open', 'Close', 'High', 'Low', 'Volume', are the stock variables from historical dataset and other variables are showing polarity of news which are the derived variables using sentiment analysis as discussed in the above section.

Descriptive Statistics:

Using describe() we could get the following result for the numerical features

open high low close volume count 5697.000000 5697.000000 5697.000000 5698.000000 5.698000e+03 mean 88.139399 89.012936 87.245609 88.146015 1.718703e+06 std 32.666995 32.960833 32.363413 32.660301 1.248357e+06 min 30.380000 31.090000 29.730000 29.940000 1.000000e+02 25% 64.650000 65.310000 64.053300 64.672500 9.880475e+05 50% 80.750000 81.490000 79.990000 80.750000 1.460298e+06 75% 105.270000 106.270000 104.350000 105.345000 2.135991e+06 max 201.240000 201.240000 198.160000 200.380000 3.378024e+07

Preprocessing and Sentiment Analysis

We filled out the NaN values in the missed three topics. And got the polarity and subjectivity for the news' topics. Polarity is of 'float' type and lies in the range of -1, 1, where 1

means a high positive sentiment, and -1 means a high negative sentiment.

So, they will be very helpful in determining the increase or decrease of the stock market.

Then we checked the missing values in the stock market information, it was complete. Then we merged the sentiment information (polarity ) by date with the stock market information (Open, High, Low, Close, Volume, Adj Close) in merged_data dataframe.

Before modeling and after splitting we scaled the data using standardization to shift the distribution to have a mean of zero and a standard deviation of one.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
rescaledValidationX = scaler.transform(X_valid)

fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.

transform() uses the same mean and variance as it is calculated from our training data to transform our test data. Thus, the parameters learned by our model using the training data will help us to transform our test data. As we do not want to be biased with our model, but we want our test data to be completely new and a surprise set for our model.

Preprocessing Again

Now, after observing the outliers in polarity of a lot of topics, we decided to concatenate all the 14 topics in one paragraph, then we can get only one column for polarity. So, we merged these data again with the stock market numerical information and got merged_data dataframe, then scaled it.

Model Building

Metrics considered for Model Evaluation

Accuracy , Precision , Recall and F1 Score

  • Accuracy: What proportion of actual positives and negatives is correctly classified?
  • Precision: What proportion of predicted positives are truly positive ?
  • Recall: What proportion of actual positives is correctly classified ?
  • F1 Score : Harmonic mean of Precision and Recall

Logistic Regression

  • Logistic Regression helps find how probabilities are changed with actions.
  • The function is defined as P(y) = 1 / 1+e^-(A+Bx)
  • Logistic regression involves finding the best fit S-curve where A is the intercept and B is the regression coefficient. The output of logistic regression is a probability score.

Choosing the features

After choosing model based on confusion matrix here where choose the features taking in consideration the deployment phase.

We know from the EDA that all the features are highly correlated and almost follow the same trend among the time. So, along with polarity and subjectivity we choose the open price with the assumption that the user knows the open price but not the close price and wants to figure out if the stock price will increase or decrease.

When we apply the logistic regression model accuracy dropped from 80% to 55%. So, we will use both Open and Close and exclude High, Low, Volume, Adj Close.

precision    recall  f1-score   support

           0       1.00      1.00      1.00   2563950
           1       0.00      0.00      0.00       968

    accuracy                           1.00   2564918
   macro avg       0.50      0.50      0.50   2564918
weighted avg       1.00      1.00      1.00   2564918







Owner
Vishal Singh Parmar
I am Vishal Singh Parmar, I have been pursuing B.Tech in Computer Science Engineering from Shivaji Rao Kadam Institute of Technology,
Vishal Singh Parmar
slim-python is a package to learn customized scoring systems for decision-making problems.

slim-python is a package to learn customized scoring systems for decision-making problems. These are simple decision aids that let users make yes-no p

Berk Ustun 37 Nov 02, 2022
The easy way to combine mlflow, hydra and optuna into one machine learning pipeline.

mlflow_hydra_optuna_the_easy_way The easy way to combine mlflow, hydra and optuna into one machine learning pipeline. Objective TODO Usage 1. build do

shibuiwilliam 9 Sep 09, 2022
Visualize classified time series data with interactive Sankey plots in Google Earth Engine

sankee Visualize changes in classified time series data with interactive Sankey plots in Google Earth Engine Contents Description Installation Using P

Aaron Zuspan 76 Dec 15, 2022
Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

BCG Gamma 26 Nov 06, 2022
🚪✊Knock Knock: Get notified when your training ends with only two additional lines of code

Knock Knock A small library to get a notification when your training is complete or when it crashes during the process with two additional lines of co

Hugging Face 2.5k Jan 07, 2023
Time series changepoint detection

changepy Changepoint detection in time series in pure python Install pip install changepy Examples from changepy import pelt from cha

Rui Gil 92 Nov 08, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 01, 2023
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 07, 2023
Practical Time-Series Analysis, published by Packt

Practical Time-Series Analysis This is the code repository for Practical Time-Series Analysis, published by Packt. It contains all the supporting proj

Packt 325 Dec 23, 2022
Markov bot - A Writing bot based on Markov Chain for Data Structure Lab

基于马尔可夫链的写作机器人 前端 用html/css完成 Demo展示(已给出文本的相应展示) 用户提供相关的语料库后训练的成果 后端 要完成的几个接口 解析文

DysprosiumDy 9 May 05, 2022
使用数学和计算机知识投机倒把

偷鸡不成项目集锦 坦率地讲,涉及金融市场的好策略如果公开,必然导致使用的人多,最后策略变差。所以这个仓库只收集我目前失败了的案例。 加密货币组合套利 中国体育彩票预测 我赚不上钱的项目,也许可以帮助更有能力的人去赚钱。

Roy 28 Dec 29, 2022
Python 3.6+ toolbox for submitting jobs to Slurm

Submit it! What is submitit? Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps

Facebook Incubator 768 Jan 03, 2023
Programming assignments and quizzes from all courses within the Machine Learning Engineering for Production (MLOps) specialization offered by deeplearning.ai

Machine Learning Engineering for Production (MLOps) Specialization on Coursera (offered by deeplearning.ai) Programming assignments from all courses i

Aman Chadha 173 Jan 05, 2023
icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

icepickle It's a cooler way to store simple linear models. The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-lea

vincent d warmerdam 24 Dec 09, 2022
Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 3.7k Jan 07, 2023
A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

pyUpSet A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al. Contents Purpose How to install How it work

288 Jan 04, 2023
Fourier-Bayesian estimation of stochastic volatility models

fourier-bayesian-sv-estimation Fourier-Bayesian estimation of stochastic volatility models Code used to run the numerical examples of "Bayesian Approa

15 Jun 20, 2022
ETNA – time series forecasting framework

ETNA Time Series Library Predict your time series the easiest way Homepage | Documentation | Tutorials | Contribution Guide | Release Notes ETNA is an

Tinkoff.AI 675 Jan 08, 2023
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 09, 2023
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 648 Dec 16, 2022