4.6.1 The Stock Market Data

This data set consists of percentage returns for the S&P 500 stock index(美國股票市場指標之一) over 1250 days, from the beginning of 2001 until the end of 2005.

For each date,

  • Lag1 through Lag5: the percentage returns for each of the five previous trading days 賺或賠的回報(比例)
  • Volume: the number of shares traded on the previous day, in billions 前一天的交易數量(百萬)
  • Today: the percentage return on the date in question
  • Direction: whether the market was Up or Down on thisdate) 當日漲或跌Y
In [108]:
#install.packages("ISLR")
rm(list = ls())
library(ISLR)
data(Smarket)
names(Smarket)
dim(Smarket)
  1. 'Year'
  2. 'Lag1'
  3. 'Lag2'
  4. 'Lag3'
  5. 'Lag4'
  6. 'Lag5'
  7. 'Volume'
  8. 'Today'
  9. 'Direction'
  1. 1250
  2. 9
In [109]:
str(Smarket)
'data.frame':	1250 obs. of  9 variables:
 $ Year     : num  2001 2001 2001 2001 2001 ...
 $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
 $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
 $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
 $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
 $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
 $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
 $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
 $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
In [110]:
summary(Smarket)
      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648  
           
           
           
           

pairwise correlations among the predictors

In [111]:
cor(Smarket[,names(Smarket) != "Direction"])

apply(cor(Smarket[,names(Smarket) != "Direction"]),
     2, FUN = function(x) abs(x) > 0.5)
A matrix: 8 × 8 of type dbl
YearLag1Lag2Lag3Lag4Lag5VolumeToday
Year1.00000000 0.029699649 0.030596422 0.033194581 0.035688718 0.029787995 0.53900647 0.030095229
Lag10.02969965 1.000000000-0.026294328-0.010803402-0.002985911-0.005674606 0.04090991-0.026155045
Lag20.03059642-0.026294328 1.000000000-0.025896670-0.010853533-0.003557949-0.04338321-0.010250033
Lag30.03319458-0.010803402-0.025896670 1.000000000-0.024051036-0.018808338-0.04182369-0.002447647
Lag40.03568872-0.002985911-0.010853533-0.024051036 1.000000000-0.027083641-0.04841425-0.006899527
Lag50.02978799-0.005674606-0.003557949-0.018808338-0.027083641 1.000000000-0.02200231-0.034860083
Volume0.53900647 0.040909908-0.043383215-0.041823686-0.048414246-0.022002315 1.00000000 0.014591823
Today0.03009523-0.026155045-0.010250033-0.002447647-0.006899527-0.034860083 0.01459182 1.000000000
A matrix: 8 × 8 of type lgl
YearLag1Lag2Lag3Lag4Lag5VolumeToday
Year TRUEFALSEFALSEFALSEFALSEFALSE TRUEFALSE
Lag1FALSE TRUEFALSEFALSEFALSEFALSEFALSEFALSE
Lag2FALSEFALSE TRUEFALSEFALSEFALSEFALSEFALSE
Lag3FALSEFALSEFALSE TRUEFALSEFALSEFALSEFALSE
Lag4FALSEFALSEFALSEFALSE TRUEFALSEFALSEFALSE
Lag5FALSEFALSEFALSEFALSEFALSE TRUEFALSEFALSE
Volume TRUEFALSEFALSEFALSEFALSEFALSE TRUEFALSE
TodayFALSEFALSEFALSEFALSEFALSEFALSEFALSE TRUE

cor(Volume,Year) > 0.5

Volum和Year有比較高的相關係數值,從圖也發現

In [126]:
# By plotting the data we see that Volume is increasing over time. In other words, the average number 
# of shares traded daily increased from 2001 to 2005.
par(mfrow = c(1,2))
plot(Smarket[,"Volume"])
plot(Smarket[,"Year"],Smarket[,"Volume"])

logistic Regression

$ P(Y=1|X)= \frac{exp(\beta_0+\beta_1 X_1+...+\beta_p X_p)}{1+exp(\beta_0+\beta_1 X_1+...+\beta_p X_p)} $

We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

In [162]:
str(Smarket[,"Direction"]) # first level = failure(0), others = success(1)
contrasts(Smarket[,"Direction"])
 Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
A matrix: 2 × 1 of type dbl
Up
Down0
Up1
In [114]:
glm.log <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume ,
               data = Smarket ,family = binomial)
summary(glm.log)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

Lag1的p-value最小,係數為-0.073

代表: 若有一個$X_{Lag1} > 0$,則會降低看漲機率$P(Y=1|X)$

In [115]:
glm.probs <- predict(glm.log, type = "response")
# type="response" option tells to output probabilities of the form $P(Y =1 | X)$
head(glm.probs)
plot(glm.probs)
1
0.507084133395401
2
0.481467878454591
3
0.481138835214201
4
0.515222355813022
5
0.510781162691538
6
0.506956460534911
In [116]:
res_fit = rep("Down", dim(Smarket)[1])
res_fit[glm.probs > 0.5] = "Up"

table(Fitted = res_fit, True = Smarket$Direction)
mean(res_fit == Smarket[,"Direction"])

In this case, logistic regression correctly predicted the movement of the market 52.16% of the time

似乎只比亂猜(50%)好一點。另外,training error = 47.84%

上面的model是全部都拿去fitting,接著,分train & test data去做:

In [132]:
train = Smarket$Year < 2005
Smarket.2005 = Smarket[!train,]

glm.log_2005 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume ,
               data = Smarket ,family = binomial, subset = train)
glm.probs_2005 = predict(glm.log_2005, Smarket.2005, type = "response")
In [136]:
summary(glm.log_2005)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket, subset = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.302  -1.190   1.079   1.160   1.350  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.191213   0.333690   0.573    0.567
Lag1        -0.054178   0.051785  -1.046    0.295
Lag2        -0.045805   0.051797  -0.884    0.377
Lag3         0.007200   0.051644   0.139    0.889
Lag4         0.006441   0.051706   0.125    0.901
Lag5        -0.004223   0.051138  -0.083    0.934
Volume      -0.116257   0.239618  -0.485    0.628

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.3  on 997  degrees of freedom
Residual deviance: 1381.1  on 991  degrees of freedom
AIC: 1395.1

Number of Fisher Scoring iterations: 3
In [135]:
res_fit_2005 = rep("Down", dim(Smarket.2005)[1])
res_fit_2005[glm.probs_2005 > 0.5] = "Up"

table(Fitted = res_fit_2005, True = Smarket.2005$Direction)
mean(res_fit_2005 == Smarket.2005$Direction)
      True
Fitted Down Up
  Down   77 97
  Up     34 44
0.48015873015873

The results are rather disappointing: the test error rate is 52 %, which is worse than random guessing!

結果在testing data的正確率只有48% QQ

注意: model裡各個解釋變數的p-value都很大. 我們試著只用Lag1,Lag2去fit model: (之後的LDA, QDA, KNN都是)

In [163]:
glm.log_2005_x1x2 = glm(Direction ~ Lag1 + Lag2 ,
               data = Smarket ,family = binomial, subset = train)
glm.probs_2005_x1x2 = predict(glm.log_2005_x1x2, Smarket.2005, type = "response")

res_fit_2005_x1x2 = rep("Down", dim(Smarket.2005)[1])
res_fit_2005_x1x2[ glm.probs_2005_x1x2 > 0.5] = "Up"

table(Fitted = res_fit_2005_x1x2, True = Smarket.2005$Direction)
mean(res_fit_2005_x1x2 == Smarket.2005$Direction)
      True
Fitted Down  Up
  Down   35  35
  Up     76 106
0.55952380952381

結果在testing data的正確率提升到56%~

In [164]:
ls()
  1. 'glm.log'
  2. 'glm.log_2005'
  3. 'glm.log_2005_x1x2'
  4. 'glm.probs'
  5. 'glm.probs_2005'
  6. 'glm.probs_2005_x1x2'
  7. 'lda_mod'
  8. 'res_fit'
  9. 'res_fit_2005'
  10. 'res_fit_2005_x1x2'
  11. 'Smarket'
  12. 'Smarket.2005'
  13. 'train'

Linear Discriminant Analysis(LDA)

In [167]:
library(MASS)
lda_mod = lda(Direction ~ Lag1 + Lag2 ,
               data = Smarket , subset = train)
lda_mod
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293
In [168]:
# Prior probabilities of groups:
table(Smarket[train,"Direction"])
# Group means:
apply(Smarket[ train & (Smarket[,"Direction"] == "Down") ,c("Lag1","Lag2") ] ,
      2, mean)
apply(Smarket[ train & (Smarket[,"Direction"] == "Up") ,c("Lag1","Lag2") ] ,
      2, mean)
Down   Up 
 491  507 
Lag1
0.0427902240325866
Lag2
0.0338940936863544
Lag1
-0.0395463510848126
Lag2
-0.0313254437869822
In [170]:
lda.pred = predict(lda_mod, Smarket.2005)
str(lda.pred)
List of 3
 $ class    : Factor w/ 2 levels "Down","Up": 2 2 2 2 2 2 2 2 2 2 ...
 $ posterior: num [1:252, 1:2] 0.49 0.479 0.467 0.474 0.493 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:252] "999" "1000" "1001" "1002" ...
  .. ..$ : chr [1:2] "Down" "Up"
 $ x        : num [1:252, 1] 0.0829 0.5911 1.1672 0.8334 -0.0379 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:252] "999" "1000" "1001" "1002" ...
  .. ..$ : chr "LD1"
  • class: LDA’s predictions about the movement of the market.
  • posterior a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class, computed from $P(Y=k | X= x) = \frac{\pi_k f_k(x)}{ \sum_{l=1}^K \pi_l f_l(x) }$, $f_k \sim N(\mu_k,\sigma^2)$
  • x contains the linear discriminants, $-0.642*Lag_1 - 0.514*Lag_2$ is large.
In [171]:
res_lda = lda.pred$class

table(fitted = res_lda, Smarket.2005$Direction)
mean(res_lda == Smarket.2005$Direction)
      
fitted Down  Up
  Down   35  35
  Up     76 106
0.55952380952381

note: LDA and logistic regression predictions are almost identical.

If applying a 50% threshold to the posterior probabilities

In [173]:
summary(lda.pred$posterior)
      Down              Up        
 Min.   :0.4578   Min.   :0.4798  
 1st Qu.:0.4847   1st Qu.:0.4994  
 Median :0.4926   Median :0.5074  
 Mean   :0.4923   Mean   :0.5077  
 3rd Qu.:0.5006   3rd Qu.:0.5153  
 Max.   :0.5202   Max.   :0.5422  
In [178]:
#lda.pred$posterior
sum(lda.pred$posterior[,2] >= 0.5) # =76+106
sum(lda.pred$posterior[,1] >= 0.5) # =35+35
182
70

Quadratic Discriminant Analysis(QDA)

In [183]:
qda_mod = qda(Direction ~ Lag1 + Lag2 ,
               data = Smarket , subset = train)
qda_mod
Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544
In [184]:
qda.pred = predict(qda_mod, Smarket.2005)
str(qda.pred)
List of 2
 $ class    : Factor w/ 2 levels "Down","Up": 2 2 2 2 2 2 2 2 2 2 ...
 $ posterior: num [1:252, 1:2] 0.487 0.476 0.464 0.474 0.49 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:252] "999" "1000" "1001" "1002" ...
  .. ..$ : chr [1:2] "Down" "Up"
In [185]:
res_qda = qda.pred$class

table(fitted = res_qda, Smarket.2005$Direction)
mean(res_qda == Smarket.2005$Direction)
      
fitted Down  Up
  Down   30  20
  Up     81 121
0.599206349206349

QDA在testing data的正確率高達60%; LDA=logistic= 56%.

K-Nearest Neighbors

In [190]:
library(class)
train_X = Smarket[ train, c("Lag1", "Lag2")]
test_X = Smarket[ !train, c("Lag1", "Lag2")]
train_Y = Smarket[ train, c("Direction")]
In [199]:
set.seed(1)
KNN_1 = knn(train = train_X, test = test_X, cl = train_Y, k = 1)
str(KNN_1)
table(Fitted = KNN_1, True = Smarket.2005$Direction)
mean(KNN_1 == Smarket.2005$Direction)
 Factor w/ 2 levels "Down","Up": 2 1 2 2 2 1 1 1 1 2 ...
      True
Fitted Down Up
  Down   43 58
  Up     68 83
0.5
In [200]:
KNN_3 = knn(train = train_X, test = test_X, cl = train_Y, k = 3)
str(KNN_3)
table(Fitted = KNN_3, True = Smarket.2005$Direction)
mean(KNN_3 == Smarket.2005$Direction)
 Factor w/ 2 levels "Down","Up": 1 1 1 2 2 1 2 1 2 2 ...
      True
Fitted Down Up
  Down   48 54
  Up     63 87
0.535714285714286
logistic LDA QDA KNN
56% 56% 60% 54%

代表這筆資料上,QDA model的assumption可能比較適合!

In [ ]: