Project ECON 427

Project ECON 427,
1. Predicting Stock Price Movements
The goal of this project is to predict stock pricesby applying machine learning techniques
to data from StockTwits, a social media platform for investors. We extract
features from textual data, and formulate price prediction as both a regression and a
classification problem. We demonstrate the results and analyze them.
(a) Make yourself familiar with the StockTwits platform: https://stocktwits.
com/.
(b) The goal is to perform analysis on the component stocks of the Dow Jones Industrial
Average. Data were collected for the period December 2013 to December
2016, totaling 756 trading days. Two main datasets are used:
i. StockTwits Data: The data were collected and downloaded in raw JSON
format, totaling over 540,000 messages. Sentiment polarity was also extracted
from user-generated “bullish”/ “bearish” tags.
A. Calculate the difference of the number of bullis and bearish tags and
divide it by the total number of messages tagged for each stock in each
day, to find a polarity for each stock in each day, and calculate a moving
average for this ratio, and call it st
. Use a 3 point moving average initially,
but you can try to change the window size of the moving average to see
if you can get better results when you are training models.
B. Calculate the number of messages for each stock in each day, which we
call message volume.
C. Calculate the percentage 1-day message volume change, which is the difference

代做ECON 427作业、代写Stock Price作业、代做Python/Java编程作业
between today’s message volume and yesterday’s message volume
divided by yesterday’s message volume and call it mv1,t.
D. Calculate today’s message volume divided by the average message volume
in the previous 10 days and call it mv10,t.
ii. Price Data: Daily split-adjusted stock price data was collected via the Yahoo
Finance API. You can only focus only on the closing price data for the
purposes of this project, but you are welcome to test your algorithms for
other prices in the data set as well.
iii. Prediction Target: We focus on the forward T-day return, calculated as a
percentage change for the future price movement three days ahead of today’s
trading price, i.e.:
rt(T) = pt+T pt pt
where pt+T is the price at time period t + T, i.e. T days ahead. Calculate
rt(3) and rt(5) from the data for each company. Later, we will try to predict
them using various techniques.
(c) Pre-Processing and Exploratory data analysis:
1
Project ECON 427,
i. There are exceedingly large number of posts about AAPL. You can remove
AAPL from your analysis if the computational burden is too much for your
computer.
ii. Search what stop words mean and remove them from the data.
iii. Remove company names from the data.
iv. Remove posts mentioning/tagging multiple stocks (e.g. “$AAPL $FB $GOOG”).
v. Aggregate posts by date. For each date in the the period December 2013 to
December 2016, you should have a set of tweets for each company in that
date.
vi. Use 70% of the data for training and 30% for testing. Remember not to select
training and test data randomly. Use the first 70% of the days for training
and the last 30% for testing (January 2016 to December 2016). Explain whay
this is a correct way of splitting the data.
(d) Bag of Words Features
i. Calculate the frequencies of the words in the data.
ii. Only keep words that occured at least 25 times in the dataset. This should
give you more than 6800 words.
iii. For each of the words in 1(d)ii, calculate the TF-IDF metric with Laplace
smoothing. Those metrics are used as features in your classification models.
(e) Chi-Squared Statistics
i. Since the number of features is very large, we use a preliminary feature selection
method that detects correlation between features. Use the chi-squared
test to select the first 1000 important features with highest chi-squared scores.
(f) Classification
i. Explain how prediction of rt(T) can be converted into a binary classification
problem and convert the responses to binary labels.
ii. Na¨ve Bayes Binary Classifier
A. Train a Na¨?ve Bayes classifier using bag of words features.
B. Report train and test accuracy for this model.
C. Build a confusion matrix for both training and test data.
D. Report AUC, precision, recall, and F1-scores for both training and testing
data.
iii. Logistic Regression
A. Apply Recursive Feature elimination on the chi-squared features to train
a Logistic Regression model for binary classification.
B. Train an L 1-penalized Logistic Regression using the chi-squared features
as well as st
, mv1,t, and mv10,t. Use 5-fold cross validation to find the best
hyper-parameter.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
2
Project ECON 427, Instructor: Mohammad Reza Rajati
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
iv. Random Forests and Extra Trees
A. Use as many of the 1000 chi-squared features as you can (at least the top
20) along with st
, mv1,t, and mv10,t to train a random forest model for
binary classification.
B. Repeat 1(f)ivA using Extra Trees.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
v. Support Vector Machines
A. Train an L 1-penalized SVM using the chi-squared features as well as
st
, mv1,t, and mv10,t. Use 5-fold cross validation to find the best hyperparameter
B. Report train and test accuracy for both models.
C. Build a confusion matrix for both training and test data for both models.
D. Report AUC, precision, recall, and F1-scores for both training and testing
data.
(g) Regression
i. KNN Regression
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to perform
KNN regression on the data. Use 5-fold cross validation to determine the
value of k ∈ {5, 6, . . . , 30}. You are welcome to test the effect of larger
k’s.
B. Map any predicted ?r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
ii. Support Vector Regression1
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to train a
Support Vector regression model on the data. Use L2 regularization. Use
5-fold cross validation to determine the hyperparameters of the algorithm.
1https://medium.com/coinmonks/support-vector-regression-or-svr-8eb3acf6d0ff
3
Project ECON 427,
B. Map any predicted r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
iii. Random Forest and Extra Tree Regression
A. Use the chi-squared features along with st
, mv1,t, and mv10,t to train a
Random Forest regression model and and an Extra Tree regression model
on the data.
B. Map any predicted r(T) whose absolute value is bigger than a reasonable
threshold (the suggested value is 0.5%, but you are welcome to try other
thresholds as well. Obviously, if the threshold is elected to be 0%, there
is not any no action signal) into a positive or negative signal.
C. Report train and test accuracy for both models.
D. Build a confusion matrix for both training and test data for both models.
E. Report AUC, precision, recall, and F1-scores for both training and testing
data.
F. Note: If you have a no action signal, the cases that are detected as no
action should not be considered in evaluationg classification metrics.
(h) Improving The Models: Use any method you know, including ensemble methods,
to yield the best classifier and the best regression model you can. You may
want to reduce the number of features you use using recursive feature elimination.
In that case, use recursive feature elimination inside your cross validation loops.
You are free to use any technique, for example a Recurrent Neural Network or
XGBoost.
(i) Explain why even a test accuracy slightly above 50% is not bad for this problem,
although it is a binary classification problem. Make every effort to have a test
accuracy of at least 60%.
(j) Make a table of the test accuracies for each stock, and identify the best three and
the worst three accuracies. Comment on your results.
2. Trading Scenario
(a) Your capital at the beginning of each day is Ct and is Et at the end of each day.
Assume that you are considering days {1, 2, . . . , τ} in your test set where τ is the
number of your test days. You make long/short decisions in days {1, 2, . . . , τ T}.
Because you have to wait T days to see the effect of your decisions on your capital,
4
Project ECON 427, Instructor: Mohammad Reza Rajati
you calculate your capital at the end of days {1 + T, 2 + T, τ}. Repeat all of the
following steps for both T = 3 and T = 5.
(b) Start with an initial capital of C1 = C2 = · · · = C1+T = E1 = E2 = . . . = ET =
$90, 000. Only 1/3 of your total money at the end of the previous day should
be invested at the beginning of each day. Thus, if C
is the amount you invest
on stocks on day t, you would initially invest C
changes from
$30, 000 at day t = 2 + T, and because the effect of your decisions in day 1 will
change your capital at the end of day 1 + T (which is E1+T ), and 1/3 of E1+T will
be available capital for investment at day t = 2 + T, i.e. C
0
2+T = E1+T /3.
(c) Invest equal amounts of money in each company. Therefore, if you are considering,
say, M = 25 companies, invest I
/M in day t in each company. This means
you initially invest I
/M = $30, 000/25 = $1200 in
company m (if it makes your calculations simpler, you can consider fractional
shares, but small remainders do not seem to significantly affect the results). If
the price of each share of company m in day t is pt
, this means you invest in
/pt shares of company m in day t.
(d) Start making decisions in the first day in your training set. Trading is done using
long/short signals. If your predicted trade signal for company m is positive in
day t (i.e. if you predict that its price will go up in day t + T), long its shares,
i.e. calculate your return for the share of company m in day t using the following
On the other hand, if your predicted trade signal for a company is negative in
day t (i.e. if you predict that its price will go down in day t+T), short its shares,
i.e. calculate your return for the share of company m in day t using the following
formula:dicted no action for a stock using the regression methods, obviously.
(e) The effect of decisions in day t T on your capital are revealed when you realize
the prices in day t. The total gains and losses on day t resulting from long/short
decisions on day t T is calculated as:
is the number of shares of company m that was traded in day t T,
and the comission for each trade is considered to be $0.0075, unless r
(T) was
5
Project ECON 427,
predicted to be 0 (no action, by a regression model), where qtT = 0. Thus, your
capital at the end of day t is:Et = Ct + t, t > T
Obviously, Ct+1 = Et
, t ∈ {T, · · · , τ1}, but we introduced Ct and Et
for clarity
of the above descriptions.
(f) Plot Ct over the test period, for each of your prediction algorithms on the same
graph and compare them. Which method makes you richer at the end of the
test period? You can include any custom-made algorithm you created to improve
the results in this comparison and argue that it works better than the standard
algorithms offered in the description of the project.
(g) Comparison with Oracle trading Dow Jones Industrial Average (DIA):
repeat the above scenario for the Dow Jones industrial average (DIA) and an
omniscient trader (Oracle), i.e. instead of predicting the movements using any
of your algorithms, use the true movements. In other words, if the actual T-day
ahead return is positive in a day, long the stock, and if it is negative, short the
stock. Compare all of your algoritms with the performance of Oracle on DIA on
the same plot and draw conclusions.

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected]

微信:codinghelp

原文地址:https://www.cnblogs.com/cibc/p/11011510.html

时间: 2024-11-26 12:01:10

Project ECON 427的相关文章

Team Foundation Server 2013 with Update 3 Install LOG

[Info   @10:14:58.155] ====================================================================[Info   @10:14:58.163] Team Foundation Server Administration Log[Info   @10:14:58.175] Version  : 12.0.30723.0[Info   @10:14:58.175] DateTime : 10/03/2014 18:1

Econ 3818 R data project

Econ 3818Spring 2019R data project Unlike your other R assignments, this assignment is individual work only. You may discuss your project with classmates but you must have your own unique project.. Final write-up is due via email by 5 pm on Monday, A

WindowsForm如何实现类似微软project软件的甘特图?

在管理软件研发过程中,特别是涉及项目管理或者生产计划方面,都需要一款类似微软project的控件对项目下的分解任务进行图形展示(甘特图).下面介绍一下在WindowsForm下如何实现类似微软project软件的甘特图?最终的效果如下所示: 1.VS2012创建一个Windows应用程序GanttChartControl,并添加甘特图控件库,结构如下: (注:此处甘特图控件是一款开源库,源码可从网上自行下载) 2.自定义任务类和资源类,代码如下: 1 #region 自定义任务和资源类 2 //

(转载)解决AndroidStudio导入项目在 Building gradle project info 一直卡住

源地址http://blog.csdn.net/yyh352091626/article/details/51490976 Android Studio导入项目的时候,一直卡在Building gradle project info这一步,主要原因还是因为被墙的结果.gradle官网虽然可以访问,但是速度连蜗牛都赶不上... 解决办法主要有两种,一是直接下载gradle离线包,二是修改项目的gradle-wrapper.properties里的gradle版本为自己电脑已有的版本. 离线包下载导

maven -- 问题解决(三)Java compiler level does not match the version of the installed Java project facet

问题: Java compiler level does not match the version of the installed Java project facet 解决方法如下: properties->Java Compiler,修改JDK版本,然后Apply

畅通project续HDU杭电1874【dijkstra算法 || SPFA】

http://acm.hdu.edu.cn/showproblem.php?pid=1874 Problem Description 某省自从实行了非常多年的畅通project计划后.最终修建了非常多路.只是路多了也不好,每次要从一个城镇到还有一个城镇时,都有很多种道路方案能够选择,而某些方案要比还有一些方案行走的距离要短非常多.这让行人非常困扰. 如今,已知起点和终点,请你计算出要从起点到终点.最短须要行走多少距离. Input 本题目包括多组数据.请处理到文件结束. 每组数据第一行包括两个正

log4j的1.2.15版本,在pom.xml中的顶层project报错错误: Failure to transfer javax.jms:jms:jar:1.1 from https://maven-repository.dev.java.net/nonav/repository......

在动态网站工程中,添加了Pom依赖,当添加log4j的1.2.15版本依赖时,在pom.xml中的顶层project报错错误: Failure to transfer javax.jms:jms:jar:1.1 from https://maven-repository.dev.java.net/nonav/repository......,如下图 这是因为 https://maven-repository.dev.java.net/nonav/repository 这个域名已经无法解析了. 而

Android studio project文件结构翻译

Android studio project文件结构翻译 个人翻译,用作备忘. 链接地址:https://developer.android.com/tools/projects/index.html#ApplicationModules Android Project Files Studio的项目文件和设置,设置的作用范围包含所有的module. 以以下的demo为例. .git:git版本控制的文件存放目录. .gradle:gradle执行一些编译所生成的目录 Idea: 由Intell

eclipse工程总是提示红叉,但是没有看到哪出错了!The project was not built due to "Could not delete

最近在编译web project的时候,页面总是提示xx方法没有被定义为xx类.但是明明都是正确的. 而且在工程上有个红叉但是程序并没有看到哪有错误.如下图: 解决方法: 1.先打开problem窗口,才能看到意想不到的错误. Window-show view-other-找到problem.然后就会看到一条错误.如下: The project was not built due to "Could not delete '/build/com'.". Fix the problem,