Monday, 29 April 2013

Editing/Adding factor levels in R

I was trying to change few levels in my factor variable by simply coercing characters on that factor variable but it dint seem to work.

iris$Species[c(50:120)] <- rep("Random", 71)
## Warning: invalid factor level, NAs generated
##   [1] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##   [8] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [15] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [22] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [29] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [36] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [43] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [50] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     

##  [57] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  [64] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     

##  [71] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  [78] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     

##  [85] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  [92] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     

##  [99] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
## [106] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     

## [113] <NA>      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
## [120] <NA>      virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica virginica
## [134] virginica virginica virginica virginica virginica virginica virginica
## [141] virginica virginica virginica virginica virginica virginica virginica
## [148] virginica virginica virginica
## Levels: setosa versicolor virginica

Well, I did find a way to find a work around for that by doing this:

iris$Species <- as.character(iris$Species)
iris$Species[c(50:120)] <- rep("Random", 71)
iris$Species <- as.factor(iris$Species)
##   [1] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##   [8] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [15] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [22] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [29] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [36] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [43] setosa    setosa    setosa    setosa    setosa    setosa    setosa   
##  [50] Random    Random    Random    Random    Random    Random    Random   
##  [57] Random    Random    Random    Random    Random    Random    Random   
##  [64] Random    Random    Random    Random    Random    Random    Random   
##  [71] Random    Random    Random    Random    Random    Random    Random   
##  [78] Random    Random    Random    Random    Random    Random    Random   
##  [85] Random    Random    Random    Random    Random    Random    Random   
##  [92] Random    Random    Random    Random    Random    Random    Random   
##  [99] Random    Random    Random    Random    Random    Random    Random   
## [106] Random    Random    Random    Random    Random    Random    Random   
## [113] Random    Random    Random    Random    Random    Random    Random   
## [120] Random    virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica virginica
## [134] virginica virginica virginica virginica virginica virginica virginica
## [141] virginica virginica virginica virginica virginica virginica virginica
## [148] virginica virginica virginica
## Levels: Random setosa virginica

This problem annoyed me at first, “Why would R not allow me to change/add factor levels!?!@#!@#?” but then Utkarsh and I had a conversation about this which made me think otherwise.

Excerpts from the conversation:

Utkarsh: It is usually not good to create data on the fly. Besides, when you create a factor variable, you should give the finite set of values it can take. This prevents future mistakes. It is called type checking. Python does not do it. R does it to some extent. C does it to some extent. Haskell does it very very strictly and it prevents about 50% of bugs from appearing. Let's say you misspell one of the levels.

In retrospect, it actually makes sense for us not to be able to add/edit the levels in factor variables. For a simple reason, we “might” make mistake, and misspelling a factor level could cause serious trouble. Lesson learnt!

Sunday, 28 April 2013

Forecasting stock returns using ARIMA model with exogenous variable in R

Why is it important?

Why is it important?

India has a lot to achieve in terms of becoming a developed nation from an economic standpoint. An aspect which, in my opinion, is of utmost importance is the formation of structurally sound and robust financial markets. A prerequisite for that is active participation of educated and informed traders in the market place which would result in better price discovery and in turn better functioning market in general.

Statistical modelling techniques supplemented with some subject understanding could be an informed trading strategy. In the long run it might not be possible to outplay the market using a simple backward looking statistical model, but in the short run intelligent estimates based on model and subject matter expertise could prove to be helpful. In our previous posts with Infosys stock prices, we used basic visualization and simple linear regression techniques to try and predict the future returns from historical returns. Lets step on the pedal and move over to some more sophisticated techniques to do the same. We will start with the same basics of running basic checks on the data and then take a deeper dive in terms of modelling technique to use.

data <- read.csv("01-10-2010-TO-01-10-2011INFYEQN.csv")
##         Date      Close.Price  
##  01-Apr-11:  1   Min.   :2183  
##  01-Aug-11:  1   1st Qu.:2801  
##  01-Dec-10:  1   Median :2993  
##  01-Feb-11:  1   Mean   :2929  
##  01-Jul-11:  1   3rd Qu.:3106  
##  01-Jun-11:  1   Max.   :3481  
##  (Other)  :245
plot(as.Date(data$Date, "%d-%b-%y"), data$Close.Price, xlab = "Dates", ylab = "Adjusted closing price", 
    type = "l", col = "red", main = "Adjusted closing price of INFOSYS for past 1 year")

plot of chunk unnamed-chunk-1

There seems to be a lot of randomness in the series and the adf.test results prove that the series is non-stationary (I(1)). Which means that the series will have to be first differenced to make is stationary. (Refer to this post for more understanding on stationarity).

library(tseries, quietly = T)
##  Augmented Dickey-Fuller Test
## data:  data$Close.Price 
## Dickey-Fuller = -2.451, Lag order = 6, p-value = 0.3858
## alternative hypothesis: stationary
infy_ret <- 100 * diff(log(data$Close.Price))

Auto-regressive moving average (ARMA) model

There is one primary difference between time series and cross sectional datasets and that is the presence of auto-correlation in time series data. The concept of auto-correlation is not applicable to cross sectional regression as there is no dependence in the observations. However, there is explicit dependent of time series' future value on its near past values. We arrive at the estimates in a time series model after solving the Yule Walker equations unlike MLE or simple OLS techniques in the case of cross sectional linear regressions.

The idea of an ARMA model is fairly intuitive to understand, however, the math gets extremely tricky. We will take a crack at explaining what an ARMA model is in laymen language. A typical ARMA(1,1) model can be expressed as :

\[ \begin{equation} z_t = \alpha + \phi z_{t-1} + \theta\epsilon_{t-1} + \epsilon_t \end{equation} \]

The (1,1) in the equation stand for the auto-regressive(\( z_{t} \)) and moving average(\( \epsilon_{t} \)) lag orders respectively. The intuitive understanding of the above equation is pretty straightforward. The current value of the time series \( z_t \) will depend on the past value of the series \( z_{t-1} \) and will correct itself to the error made in the last time period \( \epsilon_{t-1} \). Lets try and fit an ARMA model to our INFY returns data and see how the results turn out.

summary(arma(infy_ret, order = c(2, 2)))
## Call:
## arma(x = infy_ret, order = c(2, 2))
## Model:
## ARMA(2,2)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5723 -0.9214 -0.0719  0.9289  4.5550 
## Coefficient(s):
##            Estimate  Std. Error  t value Pr(>|t|)    
## ar1        -0.34007     0.02591   -13.12   <2e-16 ***
## ar2        -0.97524     0.01817   -53.67   <2e-16 ***
## ma1         0.44427     0.00702    63.31   <2e-16 ***
## ma2         1.01522     0.00802   126.60   <2e-16 ***
## intercept   0.00525     0.21054     0.02     0.98    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## Fit:
## sigma^2 estimated as 2.9,  Conditional Sum-of-Squares = 718.3,  AIC = 985.8

We can see that AR as well as MA coefficients are all significant at 99%, evident from the small p-values. Since our objective here is to forecast future returns lets evaluate the performance of the ARMA model in terms of out-of-sample forecast performance. For this we will divide the data into 2 parts, on one we will train the model and on the other we will test the out-of-sample forecast ability.

Here Wehave used ARIMA function to fit the model as the object type “arima” is easily compatible with forecast() and predict() function. ARIMA is nothing by a normal ARMA model with the order of integration included as an argument to the function. In our case, our series was I(1) but we have first differenced it already so in the ARIMA function we will keep the “I” part = 0.

library(forecast, quietly = T)
infy_ret_train <- infy_ret[1:(0.9 * length(infy_ret))]  # Train dataset
infy_ret_test <- infy_ret[(0.9 * length(infy_ret) + 1):length(infy_ret)]  # Test dataset

fit <- arima(infy_ret_train, order = c(2, 0, 2))
arma.preds <- predict(fit, n.ahead = (length(infy_ret) - (0.9 * length(infy_ret))))$pred
arma.forecast <- forecast(fit, h = 25)

plot(arma.forecast, main = "ARMA forecasts for INFY returns")

plot of chunk unnamed-chunk-4

accuracy(arma.preds, infy_ret_test)[2]  # RMSE values
##  RMSE 
## 2.489

Above are the results that we obtain with a simple ARMA(2,2) model. The orange and yellow region provide us the 99% and 95% confidence level for the forecasts respectively. An intrinsic shortcoming of ARMA models, which is evident from the plot above, is the assumption of mean reversion of the series. What this means is that after some time in future the forecasts would tend to the mean of the time series \( z_{t} \)'s historical values thus making it a poor model for long term predictions.

Now, there are some intuitive variables that one can introduce in the model based on subjective understanding to improve the model. In cases where one wishes to augment a simple univariate time series regression with some exogenous set of variable, ARIMAX function can be employed. In cases where the additional variables could have a feedback relation with the time series in question (i.e they are endogenous) one can employ Vector auto regressive (VAR) models. Let me try and elaborate a little on them before I start to sound confusing. In our example above in question, lets say that our hypothesis is that day of the week has an effect on the stock prices. To include this in our model all that we need is 4 new dummy variables for 4 days of the week (5th one by default goes to the intercept) and include them in the above ARMA model using ARIMAX function. Here these dummy variables will be completely exogenous to our dependent variable (INFY returns), because no matter how/what the stock price is for INFY, its not going to affect the day of the week! However, lets say we wanted to include NIFTY returns as an additional variable in the analysis, a VAR model would be preferable. The reason being that there could be a feedback relation between INFY returns and NIFTY returns which might be ignored if we use a simple ARIMAX function.

ARIMA model with day of the week variable

We will try and illustrate with an example the former where we will use day of the week as an exogenous variable to augment our ARMA model for INFY returns. The ARIMAX model can be simply written as:

\[ \begin{equation} z_t = \alpha + \phi z_{t-1} + \theta\epsilon_{t-1} + \gamma x_t + \epsilon_t \end{equation} \]

where, \( x_{t} \) is the exogenous variable. In our case we will have 4 dummy variables created for the 4 days.

data$day <- as.factor(weekdays(as.Date(data$Date, "%d-%b-%y")))
days <- data$day[2:nrow(data)]
xreg1 <- model.matrix(~as.factor(days))[, 2:5]
colnames(xreg1) <- c("Monday", "Thursday", "Tuesday", "Wednesday")

fit2 <- arima(infy_ret_train, order = c(2, 0, 2), xreg = xreg1[c(1:(0.9 * length(infy_ret))), 
fit1.preds <- forecast(fit2, h = 25, xreg = xreg1[c(226:250), ])
fit1.preds <- predict(fit2, n.ahead = 25, newxreg = xreg1[c(226:250), ])
plot(forecast(fit2, h = 20, xreg = xreg1[c(226:250), ]), main = "ARIMAX forecasts of INFY returns")

plot of chunk unnamed-chunk-5

accuracy(fit1.preds$pred, infy_ret_test)[2]
##  RMSE 
## 2.431

The performance of the ARIMA model with weekdays factor variable seems to be better than a simple ARMA model which is evident from the lower RMSE of the ARIMAX model. This is just one example of variables that could be used to augment a simple ARMA model, there could be many more variants of such variables that might further increase the performance of the model. In the next post we would try to cover vector auto regression and how/when it can be used.

Feedback/criticisms welcome.

Tuesday, 19 March 2013

Dealing with different object types in a vector in R

I came across a little problem while dealing with a vector in R which had one of the most simple solutions. These are, in my opinion, the most annoying problems with the most simple and commonsensical solution. Anyways, yet again Utkarsh comes to rescue and slaps me with the realization that no matter what one does nothing can match an open mind and common sense.Well, I hope someone struggling with a similar problem might bump into this post and might save him/her some frustration and time.


How to deal with different object types in one vector.


> testdata <- data.frame(id = c(1:5), x = c(1,2,3,NA,"Shre"))
> testdata
  id    x
1  1    1
2  2    2
3  3    3
4  4 <NA>
5  5 Shre
Here we have a data frame "testdata" where we have a variable "x" which say is supposed to be numeric but due to that funny entry "Shre" in the 5th row, the variable becomes a "factor".

> sapply(testdata[1,],class)
       id         x 
"integer"  "factor" 

 Now, if I try to convert the the variable "x" to numeric using as.numeric() this is what happens.
> testdata$x <- as.numeric(testdata$x) 
> testdata$x
[1]  1  2  3 NA  4
R understands "x" to be a factor and tries to coerce it to a numeric, which is a common mistake that I make in data analysis. This does not throw any error which makes it all the more difficult to detect/debug. Treating "x" to be factor the new value "Shre" is treated as the 4th category of the factor and is assigned numeric value "4" which is not what you want.

So an elegant way of dealing with this is:

> testdata$x <- as.numeric(as.character(testdata$x)) 
Warning message:
NAs introduced by coercion 
> testdata$x
[1]  1  2  3 NA NA
Tada! We have coerced "NA's" but at least we do not have a random value "4" assigned to an observation.

Tuesday, 4 December 2012

R FAQs for the fresh starters

R, which was largely predominant in the academic world, has started picking up a lot in businesses as well. At least that is what I am witnessing among my colleagues. Lot of people have started experimenting with R, choosing the path to enlightenment. With increase usage, however, have come an increased number of queries as well. What's interesting to see that though people are working on very different projects, the queries are largely the same, with most of them relating to data handling in R.

Keeping that in mind, I thought it would be nice to have a repository of Frequently Asked Questions. I can then directly refer the inquirers to the webpage. Below are the queries that I have been asked most often, not necessarily in order of number of queries though.

We will use the classic iris data set to illustrate the code with examples.

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

NOTE: In case you plan to go through the queries in sequence, please run data(iris) before each query, unless otherwise specified.

1. How to name or rename a column in a data frame?
We can use the names() function to do this.

To find the column  names of a data frame, do

To change the name of a column, say the 4th column, do
names(iris)[4] <- ''

To change the names of multiple columns, say 2nd and 4th, do
names(iris)[c(2, 4)] <- c(', '')

To change the names of all the columns, do
names(iris) <- c(', '' ....)

2. How to determine the column information like names, type, missing values etc. in R? Similar to proc contents in SAS.
There are two easy functions to do this.

To get brief info, do 

To get detailed info, do

3. How to export a data frame so that it can be used in other applications?
The best way is to export a csv file since most applications accept that format.
write.csv(iris, 'iris.csv', row.names= FALSE)

4. How to select a particular row/column in a data frame?
The easiest way to do this is to use the indexing notation [].

To select the first column only
iris[, 1]

To select first column and put contents in a new vector
new.vec <- iris[, 1]

To select multiple columns, say 1st, 2nd and 5th,  and put them in a data frame <- iris[, c(1, 2, 5)]

To select the first row only
iris[1, ]

To select first row and 3rd column
iris[1, 3]

To select multiple rows from the 3rd column
iris[c(1, 4, 10, 111), 3]

5. How to aggregate a data set based on a variable? Similar to group by in proc sql.
Say we want to aggregate the entire iris data set by Species such that the new data set will have only 3 rows and the columns will have the mean value of the respective column.

We can use the aggregate function to do this.
iris.agg <- aggregate(iris[, c(1, 2, 3, 4)], by= list(iris$Species), FUN= mean)


     Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

In case a different function needs to be applied to different columns, we need to use the powerful ddply() function from plyr.


iris.agg <- ddply(iris, .variables= .(Species), summarise,
                  sl.mean = mean(Sepal.Length),
                  sw.median = median(Sepal.Width),
                  pl.max = max(Petal.Length),
         = sd(Petal.Width), .progress= 'text')


     Species sl.mean sw.median pl.max
1     setosa   5.006       3.4    1.9 0.1053856
2 versicolor   5.936       2.8    5.1 0.1977527
3  virginica   6.588       3.0    6.9 0.2746501

6. How to create deciles of a particular variable? Similar to proc rank in SAS.
We can use the cut() function for this.
For example, to create deciles or 10 bins from Sepal.Length, do

iris$sp.decile <- cut(iris$Sepal.Length, 
                      breaks= quantile(iris$Sepal.Length, 
                      probs= seq(0, 1, by= 0.1)),
                      include.lowest= TRUE, labels= c(1:10))


 1  2  3  4  5  6  7  8  9 10 
16 16 13 20 15 15 13 12 17 13

Here we create 'sp.decile' as another column in the iris data set. After this, in case, we need to determine, say, the mean of each variable, we can use the aggregate function as below. <- aggregate(iris[, c(1, 2, 3, 4)], by= 
list(iris$sp.decile), FUN= mean)

7. How to deal with missing values? 
Dealing with missing values in R is not very difficult, provided we use the correct notation.

Suppose we know the form of missing values in our file and it is . (period), i.e., for each observation that has a missing value, there is a . (period) in that cell. Then while importing the data, do

data.set <- read.csv('filename.csv', na.strings= '.')

In case the missing value is #N/A, then do
data.set <- read.csv('filename.csv', na.strings= '#N/A')

Similarly for other cases, we can substitute the missing value notation in the na.strings argument.

In case we are not sure of the missing values, then we first need to import the data and have a look at the values to decide.

These are some of the common doubts that I have come across. I'll keep adding to the list as I keep getting newer ones. Please do let me know if there something you believe should be added to this. I'll do it right away.

Monday, 8 October 2012

CrowdANALYTIX - Ideation Contest - Warranty Pricing

I recently completed an ideation contest on CrowdANALYTIX where the participants had to build an approach towards warranty pricing and fraud detection.

Ideation contests are quite different from the usual data mining contests where the objective is solely to minimize the error (or maximize the accuracy). They are centered more around defining the problem and conceptualizing it with a framework.

In the contest, we had to structure the problem first with respect to the business and then extend it to provide possible analytical solutions for optimizing warranty prices and detecting fraud, including potential data that we would require. Having no experience in either of these areas (warranty and fraud), I tried to draw parallels between insurance pricing and warranty pricing keeping the fundamental differences between them in mind.

Luckily, my approach was chosen to be one of the finalists and was posted on their website.

Sunday, 26 August 2012

Kaggle Prospect - Harvard Business Review

This post is meant for submitting visual analysis for the Harvard Business Review Contest on Kaggle

I used the subject lines for all the articles and all the years and mapped the articles into one of the following 18 categories

  1.  Business Ethics
  2.  Business Management
  3.  Crisis
  4.  Emerging Markets
  5.  Financial Performance
  6.  Health Care
  7.  Information Technology
  8.  Labor
  9.  Leadership
  10.  Management Systems
  11.  Marketing Strategy
  12.  Regulation
  13.  Social Media
  14.  Stock Market
  15.  Strategic Planning
  16.  Supply Chain
  17.  United States & World
  18.  Women & Management

Changes in popularity of these topics were visualized using the googleVis package for R. This visualization is available here (I could not figure out how to upload it on Kaggle).


  1. The average number of pages per article has gone down steadily since the 1950s falling to below 5 pages per article in for the first time in 1981 and then staying pretty much below that mark. Could this partly be attributed to the internet revolution?
  2. Recent trending topics are related to Emerging Markets & China, and Social Media.
  3. Some evergreen topics in HBR include - Business Management, Employees/Workforce, Labor, Marketing Strategy, and Strategic Planning.
  4. The lengthiest articles are on issues concerning United States & the World, Regulation and Management Systems