Introduction to Basic
statistics & R
programming
History of R
201520042003200019971995
Research
Project in New
Zealand
Open
Source
Project
R-Core Group
R-1.0.0
released
R Foundation
First international Conf.
R-3.2.5 and R
Consortium
What is R ?
Language
Platform
Community
Ecosystem
• A programming language for statistics, analytics, and data science
• A data visualization framework
• Provided as Open Source
• Used by 2.5M+ data scientists, statisticians and analysts
• Taught in most university statistics programs
• Active and thriving user groups across the world
• CRAN: 7000+ freely available algorithms, test data and evaluation
• Many of these are applicable to big data if scaled
• New and recent graduates prefer it
Start working with R
• Install R IDE
go to https://cran.r-project.org/
Select the ‘base’ sub-directory
And then click on ‘Download R for Windows’
• Install Rstudio
http://www.rstudio.com
• Installing packages
install.packages(“<package name>”)
• Loading a package
Library(<package name>)
R Interfaces
Importing data from different mediums
• Flat files (text, csv)
• Excel files
• Relational databases
• Web
• Other statistical softwares
Data Structures in R
• Vectors - Consists of more than one element, but of the same datatype. The c() function is used to
create a vector.
• Matrix - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function. All columns in a matrix must have the same mode(numeric, character, etc.) and
the same length.
• Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension.
• Dataframes - A data frame is more general than a matrix, in that different columns can have different
modes (numeric, character, factor, etc.).
• List - A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
• Factors - The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is
the number of unique values in the nominal variable), and an internal vector of character strings (the
original values) mapped to these integers.
R Charts and Graphs
• Histogram
• Dot Plot
• Pie Chart
• Box Plot
• Scatter Plot
Basic Statistics
• Inferential vs Descriptive
• Sample vs population
• Central tendencies
1. Mean
2. Median
3. Mode
• Measures of Dispersion
1. Range
2. Interquartile Range and outliers
3. Variance
4. Standard deviation
Example for Basic statistics
Lets look at a demo of what we have covered till now!
Random variables
• Defined as a set of possible values from a random experiment
• Types – Discrete vs continuous
• Expected value of random variables
• The Law of large numbers
Understanding Data distribution
Things to look for :
• Continuous or discrete
• Symmetry
• The upper and lower limits
• Likelihood of observing extreme values
• Probability of occurrence
Binomial Distribution
Basic assumptions:
1. Discrete distribution
2. Number of trials are fixed in advance
3. Just two outcomes for each trial
4. Trials are independent
5. All trials have the same probability of
occurrence
Uses include:
1. Estimating the probabilities of an outcome in
any set of success or failure trials
2. Number of defective items in a batch size of n
3. Election results
Poisson Distribution
Basic assumptions:
1. Discrete distribution
2. Occurrences are proportional over time intervals
3. Events occurs at a constant average rate
4. Occurrences are independent
Uses include:
1. Number of events in an interval of time (or area)
when the events are occurring at a constant rate
2. Call drop rate in telecom
3. Number of people arriving at a queue in a bank
4. Number of hits on a website
5. The number of typos in a book
Normal Distribution
Basic assumptions:
1. Symmetrical distribution about the mean (bell-
shaped curve)
2. Commonly used in inferential statistics
3. Family of distributions characterized is by m and s
Uses include:
1. Probabilistic assessments of distribution of time
between independent events occurring at a
constant rate
2. Shape can be used to describe failure rates that are
constant as a function of usage
Correlation and Regression Analysis
• Pearson’s r
• Also known as the correlation
coefficient between two
variables.
• Measures the strength and
direction of linear correlation.
• Value is between -1 and +1
• +1 is a strong positive
correlation and -1 is a strong
negative correlation.
• Plotting the regression line
(Linear regression)
1. 𝑌 = 𝑎 + 𝑏 𝑋 ; a is the intercept and b is
the slope
2. b = r*(
𝑆 𝑥
𝑆 𝑦
) and a = 𝑌 - b 𝑋
3. Note: Correlation is not causation
Big Data and R
Basic Big Data definition is when Data size > RAM capacity while R stores data in the memory. So the
3 ways to use R for Big Data:
• Extract Data as a sample/subset/summary
• Compute on the parts, repeat computation and combine results
• Compute on the whole
Working with Big Data in R
• R can be integrated with a lot of other data
warehouses like Hadoop, SAP Hana, SQL, Oracle etc.
• Store Data in a data warehouse that has the capacity,
then pass subsets from the warehouse to R or pass
the R code to the data warehouse.
• Nowadays major data warehouses support R code
and that is treated as one of the selling points.
• If the Data warehouse does not support R, we can still
use R with the help of API packages like dplyr.
• Advantages of an API package like dplyr:
• Built in SQL backend
• Connects to DBMS
• Transforms R code to SQL and passes it to the
DBMS
• Collects results from DBMS to R
• Flexible enough to add your own SQL backend
Challenges of open source R
$?
Lack of
scalability
Inadequate
access to
important
business data
Insufficient
business
agility
Limited
business
value
R from Microsoft brings
R Product Suite
• MS R Open
- free, open source R distribution
• MS R Server
- Secure, scalable and supported distribution on top of R open
• SQL Server 2016 R services
- building applications in R and deploying them to production using T-SQL interface
CRAN R, MRO and MRS Comparison
Data Size In-memory In-memory In-Memory or Disk Based
Speed of Analysis Single threaded Multi-threaded
Multi-threaded, Parallel
processing 1:N servers
Support Community Community Community + Commercial
Analytic Breadth &
Depth
8000+ innovative analytic
packages
8000+ innovative analytic
packages
8000+ innovative packages +
Commercial parallel high-speed
functions
License Open Source Open Source
Commercial license,
Supported release with indemnity
Microsoft
R Open
Microsoft
R Server
Microsoft R Server Platform
R Open MicrosoftR Server
DeployRDevelopR
ConnectR
•High-speed & direct
connectors
•HDFS, Teradata, SAS, SPSS,
EDWs, ODBC
ScaleR
•Fully-parallelized analytics
•Data prep & data distillation
•Variety of big data stats, predictive
modeling & machine learning
•User tools for distributing customized R
algorithms across nodes
DistributedR
•Distributed computing
framework
•Delivers cross-platform
portability
R+CRAN
•Open source R
•100% Compatible
with existing R
scripts, functions and
packages
RevoScaleR
•High-performance Math
Kernel Library (MKL) to
speed up linear algebra
functions
SQL Server R Services:
Enterprise R Analytics in SQL Server 2016
Model & Deploy In SQL16:
• Support Entire Analytics Lifecycle
• Enable R Users to Run R Inside SQL 2016
• Enable SQL Users to Extend BI Applications Using R
Analytics
Advantages:
• Scale By Eliminating Movement
• Scale Using Parallelized Analytics
• Reduced Security Exposure
• SQL Skill Reuse for Data Engineering
• SQL Skill Reuse for App development
• Improved Operational Stability for Applications
SQL
2016
OperationalizeModelPrepare
IEEE Spectrum July 2015
Language Popularity
IEEE Spectrum Top Programming Languages
R’s popularity is growing rapidly
R Usage Growth
Rexer Data Miner Survey, 2007-2013
• Rexer Data Miner Survey
#9: R
Bibliography
• Datacamp tutorials
• Coursera and EdX sites
• Download ‘Swirl’ package from CRAN repository for hands-on practice
• Subscribe to www.r-bloggers.com
• For basic statistics : www.stattrek.com

Introduction to basic statistics

  • 1.
  • 2.
    History of R 201520042003200019971995 Research Projectin New Zealand Open Source Project R-Core Group R-1.0.0 released R Foundation First international Conf. R-3.2.5 and R Consortium
  • 3.
    What is R? Language Platform Community Ecosystem • A programming language for statistics, analytics, and data science • A data visualization framework • Provided as Open Source • Used by 2.5M+ data scientists, statisticians and analysts • Taught in most university statistics programs • Active and thriving user groups across the world • CRAN: 7000+ freely available algorithms, test data and evaluation • Many of these are applicable to big data if scaled • New and recent graduates prefer it
  • 4.
    Start working withR • Install R IDE go to https://cran.r-project.org/ Select the ‘base’ sub-directory And then click on ‘Download R for Windows’ • Install Rstudio http://www.rstudio.com • Installing packages install.packages(“<package name>”) • Loading a package Library(<package name>)
  • 5.
    R Interfaces Importing datafrom different mediums • Flat files (text, csv) • Excel files • Relational databases • Web • Other statistical softwares
  • 6.
    Data Structures inR • Vectors - Consists of more than one element, but of the same datatype. The c() function is used to create a vector. • Matrix - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function. All columns in a matrix must have the same mode(numeric, character, etc.) and the same length. • Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. • Dataframes - A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). • List - A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it. • Factors - The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
  • 7.
    R Charts andGraphs • Histogram • Dot Plot • Pie Chart • Box Plot • Scatter Plot
  • 8.
    Basic Statistics • Inferentialvs Descriptive • Sample vs population • Central tendencies 1. Mean 2. Median 3. Mode • Measures of Dispersion 1. Range 2. Interquartile Range and outliers 3. Variance 4. Standard deviation
  • 9.
    Example for Basicstatistics Lets look at a demo of what we have covered till now!
  • 10.
    Random variables • Definedas a set of possible values from a random experiment • Types – Discrete vs continuous • Expected value of random variables • The Law of large numbers
  • 11.
    Understanding Data distribution Thingsto look for : • Continuous or discrete • Symmetry • The upper and lower limits • Likelihood of observing extreme values • Probability of occurrence
  • 12.
    Binomial Distribution Basic assumptions: 1.Discrete distribution 2. Number of trials are fixed in advance 3. Just two outcomes for each trial 4. Trials are independent 5. All trials have the same probability of occurrence Uses include: 1. Estimating the probabilities of an outcome in any set of success or failure trials 2. Number of defective items in a batch size of n 3. Election results
  • 13.
    Poisson Distribution Basic assumptions: 1.Discrete distribution 2. Occurrences are proportional over time intervals 3. Events occurs at a constant average rate 4. Occurrences are independent Uses include: 1. Number of events in an interval of time (or area) when the events are occurring at a constant rate 2. Call drop rate in telecom 3. Number of people arriving at a queue in a bank 4. Number of hits on a website 5. The number of typos in a book
  • 14.
    Normal Distribution Basic assumptions: 1.Symmetrical distribution about the mean (bell- shaped curve) 2. Commonly used in inferential statistics 3. Family of distributions characterized is by m and s Uses include: 1. Probabilistic assessments of distribution of time between independent events occurring at a constant rate 2. Shape can be used to describe failure rates that are constant as a function of usage
  • 15.
    Correlation and RegressionAnalysis • Pearson’s r • Also known as the correlation coefficient between two variables. • Measures the strength and direction of linear correlation. • Value is between -1 and +1 • +1 is a strong positive correlation and -1 is a strong negative correlation.
  • 16.
    • Plotting theregression line (Linear regression) 1. 𝑌 = 𝑎 + 𝑏 𝑋 ; a is the intercept and b is the slope 2. b = r*( 𝑆 𝑥 𝑆 𝑦 ) and a = 𝑌 - b 𝑋 3. Note: Correlation is not causation
  • 17.
    Big Data andR Basic Big Data definition is when Data size > RAM capacity while R stores data in the memory. So the 3 ways to use R for Big Data: • Extract Data as a sample/subset/summary • Compute on the parts, repeat computation and combine results • Compute on the whole
  • 18.
    Working with BigData in R • R can be integrated with a lot of other data warehouses like Hadoop, SAP Hana, SQL, Oracle etc. • Store Data in a data warehouse that has the capacity, then pass subsets from the warehouse to R or pass the R code to the data warehouse. • Nowadays major data warehouses support R code and that is treated as one of the selling points. • If the Data warehouse does not support R, we can still use R with the help of API packages like dplyr. • Advantages of an API package like dplyr: • Built in SQL backend • Connects to DBMS • Transforms R code to SQL and passes it to the DBMS • Collects results from DBMS to R • Flexible enough to add your own SQL backend
  • 19.
    Challenges of opensource R $? Lack of scalability Inadequate access to important business data Insufficient business agility Limited business value
  • 20.
  • 21.
    R Product Suite •MS R Open - free, open source R distribution • MS R Server - Secure, scalable and supported distribution on top of R open • SQL Server 2016 R services - building applications in R and deploying them to production using T-SQL interface
  • 22.
    CRAN R, MROand MRS Comparison Data Size In-memory In-memory In-Memory or Disk Based Speed of Analysis Single threaded Multi-threaded Multi-threaded, Parallel processing 1:N servers Support Community Community Community + Commercial Analytic Breadth & Depth 8000+ innovative analytic packages 8000+ innovative analytic packages 8000+ innovative packages + Commercial parallel high-speed functions License Open Source Open Source Commercial license, Supported release with indemnity Microsoft R Open Microsoft R Server
  • 23.
    Microsoft R ServerPlatform R Open MicrosoftR Server DeployRDevelopR ConnectR •High-speed & direct connectors •HDFS, Teradata, SAS, SPSS, EDWs, ODBC ScaleR •Fully-parallelized analytics •Data prep & data distillation •Variety of big data stats, predictive modeling & machine learning •User tools for distributing customized R algorithms across nodes DistributedR •Distributed computing framework •Delivers cross-platform portability R+CRAN •Open source R •100% Compatible with existing R scripts, functions and packages RevoScaleR •High-performance Math Kernel Library (MKL) to speed up linear algebra functions
  • 24.
    SQL Server RServices: Enterprise R Analytics in SQL Server 2016 Model & Deploy In SQL16: • Support Entire Analytics Lifecycle • Enable R Users to Run R Inside SQL 2016 • Enable SQL Users to Extend BI Applications Using R Analytics Advantages: • Scale By Eliminating Movement • Scale Using Parallelized Analytics • Reduced Security Exposure • SQL Skill Reuse for Data Engineering • SQL Skill Reuse for App development • Improved Operational Stability for Applications SQL 2016 OperationalizeModelPrepare
  • 25.
    IEEE Spectrum July2015 Language Popularity IEEE Spectrum Top Programming Languages R’s popularity is growing rapidly R Usage Growth Rexer Data Miner Survey, 2007-2013 • Rexer Data Miner Survey #9: R
  • 26.
    Bibliography • Datacamp tutorials •Coursera and EdX sites • Download ‘Swirl’ package from CRAN repository for hands-on practice • Subscribe to www.r-bloggers.com • For basic statistics : www.stattrek.com