A brief introduction to 'R' statistical package

A BRIEF INTRO TO ‘R’ – APPLIED
STATS & TIME SERIES ANALYSIS
- Shanmukha Sreenivas P

THE R ENVIRONMENT
 R is an integrated suite of software facilities for data
manipulation, calculation and graphical display.
 An effective data handling and storage facility
 A suite of operators for calculations on arrays, in particular
matrices
 A large, coherent, integrated collection of intermediate tools
for data analysis
 Graphical facilities for data analysis
 A well developed, simple and effective programming
language (called ‘S’) which includes conditionals, loops, user
defined recursive functions and I/O facilities.

“OPEN SOURCE”... THAT JUST
MEANS I DON’T HAVE TO PAY FOR
IT, RIGHT?
5
•No. Much more:
–Provides full access to algorithms and their implementation
–Ability to fix bugs and extend software
–Provides a forum allowing researchers to explore and
expand the methods used to analyze data
–Promotes reproducible research by providing open and
accessible tools
–Most of R is written in… R! This makes it quite easy to see
what functions are actually doing.

WHAT IS IT?
•R is an interpreted computer language.
–Most user-visible functions are written in R itself, calling upon a
smaller set of internal primitives.
– It is possible to interface procedures written in C, C+, or
FORTRAN languages for efficiency, and to write additional
primitives.
–System commands can be called from within R
•R is used for data manipulation, statistics, and graphics.
It is made up of:
– operators (+ - <- * %*% …) for calculations on arrays &
matrices
– large, coherent, integrated collection of functions
– facilities for making unlimited types of publication quality
graphics
– user written functions & sets of functions (packages); 800+
contributed packages so far & growing

R
ADVANTAGES
DISADVANTAGES
oNot user friendly @ start - steep
learning curve, minimal GUI.
oNo commercial support; figuring out
correct methods or how to use a function
on your own can be frustrating.
oEasy to make mistakes and not know.
oWorking with large datasets is limited
by RAM
oData prep & cleaning can be messier &
more mistake prone in R vs. SPSS or
SAS
oFast and free.
oState of the art: Statistical
researchers provide their methods as
R packages. SPSS and SAS are
years behind R!
o2nd only to MATLAB for graphics.
oMx, WinBugs, and other programs
use or will use R.
oActive user community
oExcellent for simulation,
programming, computer intensive
analyses, etc.
oForces you to think about your
analysis.
oInterfaces with database storage
software (SQL)

TYPICAL R SESSION
 Start up R via the GUI or favorite text editor
 Two windows:
 1+ new or existing scripts (text files) - these will be saved
 Terminal – output & temporary input - usually unsaved

STATISTICAL METHODS
 Statistics: “meaningful” quantities about a sample of
objects, things, persons, events, phenomena, etc.
 Simple to complex issues. E.g.
 Correlation
 ANOVA
 MANOVA
 Regression – linear, multiple, logistic
 LDA
 PCA/ Factor Analysis
 Frequency domain analysis
 Econometric modelling (TSA)
 Two main categories:
* Descriptive statistics
* Inferential statistics

DESCRIPTIVE STATISTICS
 Use sample information to explain/make abstraction of
population “phenomena”.
 Common “phenomena”:
 * Association (e.g. σ1,2.3 = 0.75)
 * Tendency (left-skew, right-skew)
 * Causal relationship (e.g. if X, then, Y)
 * Trend, pattern, dispersion, range
 Used in non-parametric analysis

INFERENTIAL STATISTICS
 Using sample statistics to infer some “phenomena” of
population parameters
 Hypothesis Testing
 Common “phenomena”: cause-and-effect
* One-way r/ship - ANOVA
* Multi-directional r/ship - MANOVA
 Use parametric analysis

COMMON MISTAKES (CONTD.) – “ABUSE OF
STATISTICS”
Issue Data analysis techniques
Example of abuse Correct technique
Measure the “influence” of a variable
on another
Using partial correlation
(e.g. Spearman coeff.)
Using a regression
parameter
Finding the “relationship” between one
variable with another
Multi-dimensional
scaling, Likert scaling
Simple regression
coefficient
To evaluate whether a model fits data
better than the other
Using R2 Many – a.o.t. Box-Cox
c2 test for model
equivalence
To evaluate accuracy of “prediction” Using R2 and/or F-value
of a model
Hold-out sample’s
MAPE,MAD
“Compare” whether a group is different
from another
Multi-dimensional
Many – a.o.t. two-way
anova, c2, Z test
To determine whether a group of
factors “significantly influence” the
observed phenomenon
Multi-dimensional
Many – a.o.t. manova,
regression

TIME SERIES ANALYSIS
 A time series is a collection of observations made
sequentially in time.
11

STOCHASTIC PROCESSES USEFUL
IN MODELING TIME SERIES
(1) a purely random process,
 (2) a random walk,
(3) a moving average (MA) process,
(4) an autoregressive (AR) process,
(5) an autoregressive moving average (ARMA)
process, and
(6) an autoregressive integrated moving
average (ARIMA)process.
12


M->Multiplicative Error
N->No trend
N->No seasonality alpha = 0.1713 15

VALIDATION
Forecasts using ARIMA(1,1,2) Rel Err Forecasts using ETS(M,N,N) Rel Err
13-03-12 65 60.48468 0.069466 57.33989 0.117848
12-03-12 73 55.66896 0.237412 57.33989 0.214522
11-03-12 80 58.24566 0.271929 57.33989 0.283251
10-03-12 54 56.86697 0.053092 57.33989 0.06185
09-03-12 55 57.60465 0.047357 57.33989 0.042543
08-03-12 55 57.20995 0.040181 57.33989 0.042543
07-03-12 51 57.42114 0.125905 57.33989 0.124312
MAPE 0.120763 0.126696
16

A brief introduction to 'R' statistical package

More Related Content

What's hot

Viewers also liked

Similar to A brief introduction to 'R' statistical package

Recently uploaded

A brief introduction to 'R' statistical package