A BRIEF INTRO TO ‘R’ – APPLIED 
STATS & TIME SERIES ANALYSIS 
- Shanmukha Sreenivas P
THE R ENVIRONMENT 
 R is an integrated suite of software facilities for data 
manipulation, calculation and graphical display. 
 An effective data handling and storage facility 
 A suite of operators for calculations on arrays, in particular 
matrices 
 A large, coherent, integrated collection of intermediate tools 
for data analysis 
 Graphical facilities for data analysis 
 A well developed, simple and effective programming 
language (called ‘S’) which includes conditionals, loops, user 
defined recursive functions and I/O facilities.
“OPEN SOURCE”... THAT JUST 
MEANS I DON’T HAVE TO PAY FOR 
IT, RIGHT? 
5 
•No. Much more: 
–Provides full access to algorithms and their implementation 
–Ability to fix bugs and extend software 
–Provides a forum allowing researchers to explore and 
expand the methods used to analyze data 
–Promotes reproducible research by providing open and 
accessible tools 
–Most of R is written in… R! This makes it quite easy to see 
what functions are actually doing.
WHAT IS IT? 
•R is an interpreted computer language. 
–Most user-visible functions are written in R itself, calling upon a 
smaller set of internal primitives. 
– It is possible to interface procedures written in C, C+, or 
FORTRAN languages for efficiency, and to write additional 
primitives. 
–System commands can be called from within R 
•R is used for data manipulation, statistics, and graphics. 
It is made up of: 
– operators (+ - <- * %*% …) for calculations on arrays & 
matrices 
– large, coherent, integrated collection of functions 
– facilities for making unlimited types of publication quality 
graphics 
– user written functions & sets of functions (packages); 800+ 
contributed packages so far & growing
R 
ADVANTAGES 
DISADVANTAGES 
oNot user friendly @ start - steep 
learning curve, minimal GUI. 
oNo commercial support; figuring out 
correct methods or how to use a function 
on your own can be frustrating. 
oEasy to make mistakes and not know. 
oWorking with large datasets is limited 
by RAM 
oData prep & cleaning can be messier & 
more mistake prone in R vs. SPSS or 
SAS 
oFast and free. 
oState of the art: Statistical 
researchers provide their methods as 
R packages. SPSS and SAS are 
years behind R! 
o2nd only to MATLAB for graphics. 
oMx, WinBugs, and other programs 
use or will use R. 
oActive user community 
oExcellent for simulation, 
programming, computer intensive 
analyses, etc. 
oForces you to think about your 
analysis. 
oInterfaces with database storage 
software (SQL)
TYPICAL R SESSION 
 Start up R via the GUI or favorite text editor 
 Two windows: 
 1+ new or existing scripts (text files) - these will be saved 
 Terminal – output & temporary input - usually unsaved
STATISTICAL METHODS 
 Statistics: “meaningful” quantities about a sample of 
objects, things, persons, events, phenomena, etc. 
 Simple to complex issues. E.g. 
 Correlation 
 ANOVA 
 MANOVA 
 Regression – linear, multiple, logistic 
 LDA 
 PCA/ Factor Analysis 
 Frequency domain analysis 
 Econometric modelling (TSA) 
 Two main categories: 
* Descriptive statistics 
* Inferential statistics
DESCRIPTIVE STATISTICS 
 Use sample information to explain/make abstraction of 
population “phenomena”. 
 Common “phenomena”: 
 * Association (e.g. σ1,2.3 = 0.75) 
 * Tendency (left-skew, right-skew) 
 * Causal relationship (e.g. if X, then, Y) 
 * Trend, pattern, dispersion, range 
 Used in non-parametric analysis
INFERENTIAL STATISTICS 
 Using sample statistics to infer some “phenomena” of 
population parameters 
 Hypothesis Testing 
 Common “phenomena”: cause-and-effect 
* One-way r/ship - ANOVA 
* Multi-directional r/ship - MANOVA 
 Use parametric analysis
COMMON MISTAKES (CONTD.) – “ABUSE OF 
STATISTICS” 
Issue Data analysis techniques 
Example of abuse Correct technique 
Measure the “influence” of a variable 
on another 
Using partial correlation 
(e.g. Spearman coeff.) 
Using a regression 
parameter 
Finding the “relationship” between one 
variable with another 
Multi-dimensional 
scaling, Likert scaling 
Simple regression 
coefficient 
To evaluate whether a model fits data 
better than the other 
Using R2 Many – a.o.t. Box-Cox 
c2 test for model 
equivalence 
To evaluate accuracy of “prediction” Using R2 and/or F-value 
of a model 
Hold-out sample’s 
MAPE,MAD 
“Compare” whether a group is different 
from another 
Multi-dimensional 
scaling, Likert scaling 
Many – a.o.t. two-way 
anova, c2, Z test 
To determine whether a group of 
factors “significantly influence” the 
observed phenomenon 
Multi-dimensional 
scaling, Likert scaling 
Many – a.o.t. manova, 
regression
TIME SERIES ANALYSIS 
 A time series is a collection of observations made 
sequentially in time. 
11
STOCHASTIC PROCESSES USEFUL 
IN MODELING TIME SERIES 
(1) a purely random process, 
 (2) a random walk, 
(3) a moving average (MA) process, 
(4) an autoregressive (AR) process, 
(5) an autoregressive moving average (ARMA) 
process, and 
(6) an autoregressive integrated moving 
average (ARIMA)process. 
12
13
14
 
M->Multiplicative Error 
N->No trend 
N->No seasonality alpha = 0.1713 15
VALIDATION 
Forecasts using ARIMA(1,1,2) Rel Err Forecasts using ETS(M,N,N) Rel Err 
13-03-12 65 60.48468 0.069466 57.33989 0.117848 
12-03-12 73 55.66896 0.237412 57.33989 0.214522 
11-03-12 80 58.24566 0.271929 57.33989 0.283251 
10-03-12 54 56.86697 0.053092 57.33989 0.06185 
09-03-12 55 57.60465 0.047357 57.33989 0.042543 
08-03-12 55 57.20995 0.040181 57.33989 0.042543 
07-03-12 51 57.42114 0.125905 57.33989 0.124312 
MAPE 0.120763 0.126696 
16

A brief introduction to 'R' statistical package

  • 1.
    A BRIEF INTROTO ‘R’ – APPLIED STATS & TIME SERIES ANALYSIS - Shanmukha Sreenivas P
  • 2.
    THE R ENVIRONMENT  R is an integrated suite of software facilities for data manipulation, calculation and graphical display.  An effective data handling and storage facility  A suite of operators for calculations on arrays, in particular matrices  A large, coherent, integrated collection of intermediate tools for data analysis  Graphical facilities for data analysis  A well developed, simple and effective programming language (called ‘S’) which includes conditionals, loops, user defined recursive functions and I/O facilities.
  • 3.
    “OPEN SOURCE”... THATJUST MEANS I DON’T HAVE TO PAY FOR IT, RIGHT? 5 •No. Much more: –Provides full access to algorithms and their implementation –Ability to fix bugs and extend software –Provides a forum allowing researchers to explore and expand the methods used to analyze data –Promotes reproducible research by providing open and accessible tools –Most of R is written in… R! This makes it quite easy to see what functions are actually doing.
  • 4.
    WHAT IS IT? •R is an interpreted computer language. –Most user-visible functions are written in R itself, calling upon a smaller set of internal primitives. – It is possible to interface procedures written in C, C+, or FORTRAN languages for efficiency, and to write additional primitives. –System commands can be called from within R •R is used for data manipulation, statistics, and graphics. It is made up of: – operators (+ - <- * %*% …) for calculations on arrays & matrices – large, coherent, integrated collection of functions – facilities for making unlimited types of publication quality graphics – user written functions & sets of functions (packages); 800+ contributed packages so far & growing
  • 5.
    R ADVANTAGES DISADVANTAGES oNot user friendly @ start - steep learning curve, minimal GUI. oNo commercial support; figuring out correct methods or how to use a function on your own can be frustrating. oEasy to make mistakes and not know. oWorking with large datasets is limited by RAM oData prep & cleaning can be messier & more mistake prone in R vs. SPSS or SAS oFast and free. oState of the art: Statistical researchers provide their methods as R packages. SPSS and SAS are years behind R! o2nd only to MATLAB for graphics. oMx, WinBugs, and other programs use or will use R. oActive user community oExcellent for simulation, programming, computer intensive analyses, etc. oForces you to think about your analysis. oInterfaces with database storage software (SQL)
  • 6.
    TYPICAL R SESSION  Start up R via the GUI or favorite text editor  Two windows:  1+ new or existing scripts (text files) - these will be saved  Terminal – output & temporary input - usually unsaved
  • 7.
    STATISTICAL METHODS Statistics: “meaningful” quantities about a sample of objects, things, persons, events, phenomena, etc.  Simple to complex issues. E.g.  Correlation  ANOVA  MANOVA  Regression – linear, multiple, logistic  LDA  PCA/ Factor Analysis  Frequency domain analysis  Econometric modelling (TSA)  Two main categories: * Descriptive statistics * Inferential statistics
  • 8.
    DESCRIPTIVE STATISTICS Use sample information to explain/make abstraction of population “phenomena”.  Common “phenomena”:  * Association (e.g. σ1,2.3 = 0.75)  * Tendency (left-skew, right-skew)  * Causal relationship (e.g. if X, then, Y)  * Trend, pattern, dispersion, range  Used in non-parametric analysis
  • 9.
    INFERENTIAL STATISTICS Using sample statistics to infer some “phenomena” of population parameters  Hypothesis Testing  Common “phenomena”: cause-and-effect * One-way r/ship - ANOVA * Multi-directional r/ship - MANOVA  Use parametric analysis
  • 10.
    COMMON MISTAKES (CONTD.)– “ABUSE OF STATISTICS” Issue Data analysis techniques Example of abuse Correct technique Measure the “influence” of a variable on another Using partial correlation (e.g. Spearman coeff.) Using a regression parameter Finding the “relationship” between one variable with another Multi-dimensional scaling, Likert scaling Simple regression coefficient To evaluate whether a model fits data better than the other Using R2 Many – a.o.t. Box-Cox c2 test for model equivalence To evaluate accuracy of “prediction” Using R2 and/or F-value of a model Hold-out sample’s MAPE,MAD “Compare” whether a group is different from another Multi-dimensional scaling, Likert scaling Many – a.o.t. two-way anova, c2, Z test To determine whether a group of factors “significantly influence” the observed phenomenon Multi-dimensional scaling, Likert scaling Many – a.o.t. manova, regression
  • 11.
    TIME SERIES ANALYSIS  A time series is a collection of observations made sequentially in time. 11
  • 12.
    STOCHASTIC PROCESSES USEFUL IN MODELING TIME SERIES (1) a purely random process,  (2) a random walk, (3) a moving average (MA) process, (4) an autoregressive (AR) process, (5) an autoregressive moving average (ARMA) process, and (6) an autoregressive integrated moving average (ARIMA)process. 12
  • 13.
  • 14.
  • 15.
     M->Multiplicative Error N->No trend N->No seasonality alpha = 0.1713 15
  • 16.
    VALIDATION Forecasts usingARIMA(1,1,2) Rel Err Forecasts using ETS(M,N,N) Rel Err 13-03-12 65 60.48468 0.069466 57.33989 0.117848 12-03-12 73 55.66896 0.237412 57.33989 0.214522 11-03-12 80 58.24566 0.271929 57.33989 0.283251 10-03-12 54 56.86697 0.053092 57.33989 0.06185 09-03-12 55 57.60465 0.047357 57.33989 0.042543 08-03-12 55 57.20995 0.040181 57.33989 0.042543 07-03-12 51 57.42114 0.125905 57.33989 0.124312 MAPE 0.120763 0.126696 16