4.6.1 The Stock Market Data¶

This data set consists of percentage returns for the S&P 500 stock index(美國股票市場指標之一) over 1250 days, from the beginning of 2001 until the end of 2005.¶

For each date,¶

Lag1 through Lag5: the percentage returns for each of the five previous trading days 賺或賠的回報(比例)
Volume: the number of shares traded on the previous day, in billions 前一天的交易數量(百萬)
Today: the percentage return on the date in question
Direction: whether the market was Up or Down on thisdate) 當日漲或跌Y

#install.packages("ISLR")
rm(list = ls())
library(ISLR)
data(Smarket)
names(Smarket)
dim(Smarket)

str(Smarket)

'data.frame':	1250 obs. of  9 variables:
 $ Year     : num  2001 2001 2001 2001 2001 ...
 $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
 $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
 $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
 $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
 $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
 $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
 $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
 $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...

summary(Smarket)

      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648

pairwise correlations among the predictors¶

cor(Smarket[,names(Smarket) != "Direction"])

apply(cor(Smarket[,names(Smarket) != "Direction"]),
     2, FUN = function(x) abs(x) > 0.5)

cor(Volume,Year) > 0.5¶

Volum和Year有比較高的相關係數值，從圖也發現

# By plotting the data we see that Volume is increasing over time. In other words, the average number 
# of shares traded daily increased from 2001 to 2005.
par(mfrow = c(1,2))
plot(Smarket[,"Volume"])
plot(Smarket[,"Year"],Smarket[,"Volume"])

logistic Regression¶

$P(Y=1|X)= \frac{exp(\beta_0+\beta_1 X_1+...+\beta_p X_p)}{1+exp(\beta_0+\beta_1 X_1+...+\beta_p X_p)}$

We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

str(Smarket[,"Direction"]) # first level = failure(0), others = success(1)
contrasts(Smarket[,"Direction"])

 Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...

glm.log <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume ,
               data = Smarket ,family = binomial)
summary(glm.log)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

Lag1的p-value最小，係數為-0.073

代表: 若有一個 $X_{Lag1} > 0$ ，則會降低看漲機率 $P(Y=1|X)$

glm.probs <- predict(glm.log, type = "response")
# type="response" option tells to output probabilities of the form $P(Y =1 | X)$
head(glm.probs)
plot(glm.probs)

res_fit = rep("Down", dim(Smarket)[1])
res_fit[glm.probs > 0.5] = "Up"

table(Fitted = res_fit, True = Smarket$Direction)
mean(res_fit == Smarket[,"Direction"])

In this case, logistic regression correctly predicted the movement of the market 52.16% of the time¶

似乎只比亂猜(50%)好一點。另外，training error = 47.84%

上面的model是全部都拿去fitting，接著，分train & test data去做:

train = Smarket$Year < 2005
Smarket.2005 = Smarket[!train,]

glm.log_2005 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume ,
               data = Smarket ,family = binomial, subset = train)
glm.probs_2005 = predict(glm.log_2005, Smarket.2005, type = "response")

summary(glm.log_2005)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket, subset = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.302  -1.190   1.079   1.160   1.350  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.191213   0.333690   0.573    0.567
Lag1        -0.054178   0.051785  -1.046    0.295
Lag2        -0.045805   0.051797  -0.884    0.377
Lag3         0.007200   0.051644   0.139    0.889
Lag4         0.006441   0.051706   0.125    0.901
Lag5        -0.004223   0.051138  -0.083    0.934
Volume      -0.116257   0.239618  -0.485    0.628

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.3  on 997  degrees of freedom
Residual deviance: 1381.1  on 991  degrees of freedom
AIC: 1395.1

Number of Fisher Scoring iterations: 3

res_fit_2005 = rep("Down", dim(Smarket.2005)[1])
res_fit_2005[glm.probs_2005 > 0.5] = "Up"

table(Fitted = res_fit_2005, True = Smarket.2005$Direction)
mean(res_fit_2005 == Smarket.2005$Direction)

      True
Fitted Down Up
  Down   77 97
  Up     34 44

The results are rather disappointing: the test error rate is 52 %, which is worse than random guessing!¶

結果在testing data的正確率只有48% QQ

注意: model裡各個解釋變數的p-value都很大. 我們試著只用Lag1,Lag2去fit model: (之後的LDA, QDA, KNN都是)

glm.log_2005_x1x2 = glm(Direction ~ Lag1 + Lag2 ,
               data = Smarket ,family = binomial, subset = train)
glm.probs_2005_x1x2 = predict(glm.log_2005_x1x2, Smarket.2005, type = "response")

res_fit_2005_x1x2 = rep("Down", dim(Smarket.2005)[1])
res_fit_2005_x1x2[ glm.probs_2005_x1x2 > 0.5] = "Up"

table(Fitted = res_fit_2005_x1x2, True = Smarket.2005$Direction)
mean(res_fit_2005_x1x2 == Smarket.2005$Direction)

      True
Fitted Down  Up
  Down   35  35
  Up     76 106

結果在testing data的正確率提升到56%~

ls()

Linear Discriminant Analysis(LDA)¶

library(MASS)
lda_mod = lda(Direction ~ Lag1 + Lag2 ,
               data = Smarket , subset = train)
lda_mod

Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293

# Prior probabilities of groups:
table(Smarket[train,"Direction"])
# Group means:
apply(Smarket[ train & (Smarket[,"Direction"] == "Down") ,c("Lag1","Lag2") ] ,
      2, mean)
apply(Smarket[ train & (Smarket[,"Direction"] == "Up") ,c("Lag1","Lag2") ] ,
      2, mean)

Down   Up 
 491  507

lda.pred = predict(lda_mod, Smarket.2005)
str(lda.pred)

List of 3
 $ class    : Factor w/ 2 levels "Down","Up": 2 2 2 2 2 2 2 2 2 2 ...
 $ posterior: num [1:252, 1:2] 0.49 0.479 0.467 0.474 0.493 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:252] "999" "1000" "1001" "1002" ...
  .. ..$ : chr [1:2] "Down" "Up"
 $ x        : num [1:252, 1] 0.0829 0.5911 1.1672 0.8334 -0.0379 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:252] "999" "1000" "1001" "1002" ...
  .. ..$ : chr "LD1"

class: LDA’s predictions about the movement of the market.
posterior a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class, computed from $P(Y=k | X= x) = \frac{\pi_k f_k(x)}{ \sum_{l=1}^K \pi_l f_l(x) }$ , $f_k \sim N(\mu_k,\sigma^2)$
x contains the linear discriminants, $-0.642*Lag_1 - 0.514*Lag_2$ is large.

res_lda = lda.pred$class

table(fitted = res_lda, Smarket.2005$Direction)
mean(res_lda == Smarket.2005$Direction)

      
fitted Down  Up
  Down   35  35
  Up     76 106

note: LDA and logistic regression predictions are almost identical.

If applying a 50% threshold to the posterior probabilities¶

summary(lda.pred$posterior)

      Down              Up        
 Min.   :0.4578   Min.   :0.4798  
 1st Qu.:0.4847   1st Qu.:0.4994  
 Median :0.4926   Median :0.5074  
 Mean   :0.4923   Mean   :0.5077  
 3rd Qu.:0.5006   3rd Qu.:0.5153  
 Max.   :0.5202   Max.   :0.5422

#lda.pred$posterior
sum(lda.pred$posterior[,2] >= 0.5) # =76+106
sum(lda.pred$posterior[,1] >= 0.5) # =35+35

Quadratic Discriminant Analysis(QDA)¶

qda_mod = qda(Direction ~ Lag1 + Lag2 ,
               data = Smarket , subset = train)
qda_mod

Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

qda.pred = predict(qda_mod, Smarket.2005)
str(qda.pred)

List of 2
 $ class    : Factor w/ 2 levels "Down","Up": 2 2 2 2 2 2 2 2 2 2 ...
 $ posterior: num [1:252, 1:2] 0.487 0.476 0.464 0.474 0.49 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:252] "999" "1000" "1001" "1002" ...
  .. ..$ : chr [1:2] "Down" "Up"

res_qda = qda.pred$class

table(fitted = res_qda, Smarket.2005$Direction)
mean(res_qda == Smarket.2005$Direction)

      
fitted Down  Up
  Down   30  20
  Up     81 121

QDA在testing data的正確率高達60%; LDA=logistic= 56%.¶

K-Nearest Neighbors¶

library(class)
train_X = Smarket[ train, c("Lag1", "Lag2")]
test_X = Smarket[ !train, c("Lag1", "Lag2")]
train_Y = Smarket[ train, c("Direction")]

set.seed(1)
KNN_1 = knn(train = train_X, test = test_X, cl = train_Y, k = 1)
str(KNN_1)
table(Fitted = KNN_1, True = Smarket.2005$Direction)
mean(KNN_1 == Smarket.2005$Direction)

 Factor w/ 2 levels "Down","Up": 2 1 2 2 2 1 1 1 1 2 ...

      True
Fitted Down Up
  Down   43 58
  Up     68 83

KNN_3 = knn(train = train_X, test = test_X, cl = train_Y, k = 3)
str(KNN_3)
table(Fitted = KNN_3, True = Smarket.2005$Direction)
mean(KNN_3 == Smarket.2005$Direction)

 Factor w/ 2 levels "Down","Up": 1 1 1 2 2 1 2 1 2 2 ...

      True
Fitted Down Up
  Down   48 54
  Up     63 87

logistic	LDA	QDA	KNN
56%	56%	60%	54%

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today
Year	1.00000000	0.029699649	0.030596422	0.033194581	0.035688718	0.029787995	0.53900647	0.030095229
Lag1	0.02969965	1.000000000	-0.026294328	-0.010803402	-0.002985911	-0.005674606	0.04090991	-0.026155045
Lag2	0.03059642	-0.026294328	1.000000000	-0.025896670	-0.010853533	-0.003557949	-0.04338321	-0.010250033
Lag3	0.03319458	-0.010803402	-0.025896670	1.000000000	-0.024051036	-0.018808338	-0.04182369	-0.002447647
Lag4	0.03568872	-0.002985911	-0.010853533	-0.024051036	1.000000000	-0.027083641	-0.04841425	-0.006899527
Lag5	0.02978799	-0.005674606	-0.003557949	-0.018808338	-0.027083641	1.000000000	-0.02200231	-0.034860083
Volume	0.53900647	0.040909908	-0.043383215	-0.041823686	-0.048414246	-0.022002315	1.00000000	0.014591823
Today	0.03009523	-0.026155045	-0.010250033	-0.002447647	-0.006899527	-0.034860083	0.01459182	1.000000000

	Year	Lag1	Lag2	Lag3	Lag4	Lag5	Volume	Today
Year	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE
Lag1	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
Lag2	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE
Lag3	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE
Lag4	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
Lag5	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE
Volume	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE
Today	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE