HARNESSING R: STATISTICAL TECHNIQUES FOR DATA-DRIVEN ENTREPRENEURSHIP
HARNESSING R: STATISTICAL TECHNIQUES FOR DATA-DRIVEN ENTREPRENEURSHIP- Advanced Graphics with ggplot2, Introduction to Statistical AnalysisTypes of Statistical Analysis, Descriptive StatisticsInferential statistics, Probability Distributions in R
HARNESSING R: STATISTICAL TECHNIQUES FOR DATA-DRIVEN ENTREPRENEURSHIP
1.
HARNESSING R: STATISTICAL
TECHNIQUESFOR DATA-
DRIVEN
ENTREPRENEURSHIP
DR. SABERUNNISA. A
ASSISTANT PROFESSOR
THE MADURA
COLLEGE(AUTONOMOUS)
MADURAI
2.
WHAT IS R?
Open-sourcestatistical software: Free to use, developed under the GNU license.
Programming language & environment: Specifically designed for statistical
computing and graphics.
Widely used in Data Science, Machine Learning, and Statistics for:
Data manipulation & cleaning
Statistical modeling
Visualization (graphs, charts, plots)
Predictive analytics
Community-driven: Thousands of user-contributed packages on CRAN.
3.
History and Backgroundof R
Developed by Ross Ihaka and Robert Gentleman in 1993 at the University of
Auckland, New Zealand.
Based on the S programming language, originally developed at Bell Laboratories.
First official release in 1995; Version 1.0.0 was launched in 2000.
Supported by the R Foundation for Statistical Computing (established in 2003).
Today, R is globally recognized as one of the leading tools for statistical computing,
data science, and research.
4.
Features of R
Free and Open Source
Available at no cost, licensed under
GNU.
Huge Package Ecosystem (CRAN)
19,000+ packages covering statistics,
ML, data visualization, bioinformatics,
etc.
Cross-Platform Compatibility
Runs smoothly on Windows, Mac,
and Linux systems.
Strong Visualization Libraries
Base R graphics and advanced tools
like ggplot2, lattice, plotly.
Highly Extensible
Users can develop custom functions
and packages.
Active Community Support
Global network of developers and
researchers.
5.
Installing and UsingR
Download from CRAN (Comprehensive R Archive Network)
◦ Official website: https://cran.r-project.org
◦ Choose installer for Windows, Mac, or Linux.
RStudio IDE (Integrated Development Environment)
◦ Provides a user-friendly interface for coding in R.
◦ Features: script editor, console, plots, package manager.
Data Types inR
Basic Data Types:
Numeric → Decimal values (e.g., 3.14)
Integer → Whole numbers (e.g., 5L)
Character → Text values (e.g., "Hello")
Logical → Boolean values (TRUE, FALSE)
11.
Data Structures
Vector →Collection of elements of the same type
Matrix → 2D array of numbers (rows × columns)
Factor → Categorical data (e.g., "Male", "Female")
List → Collection of mixed data types
Data Frame → Tabular structure (rows × columns), similar to Excel
12.
Data Import
Read thedata from Excel
> library(readxl)
> data <- read_excel("C:/Users/D E L L/OneDrive/Desktop/data.xlsx")
> View(data)
READ CSV FILE
>DATA<-read.csv("C:/Users/D E L L/OneDrive/Desktop/data/data - Copy.csv")
> View(DATA)
13.
Basic R Commands
Arithmetic
Operations
x<- 10
y <- 3
x + y # Addition
x - y # Subtraction
x * y # Multiplication
x / y # Division
x ^ y # Power
Creating Vectors(c())
v <- c(2, 4, 6, 8, 10)
print(v)
Indexing and Subsetting
v[1] # First element
v[2:4] # Elements 2 to 4
v[v > 5] # Elements greater than 5
14.
Data Visualization inR
1. Base R Plots
Simple and built-in plotting system
x<-c(1,2,3,4,5)
y<-c(2,4,6,8,10)
plot(x, y, type=“o", col="blue", main="Base R Plot")
Introduction to StatisticalAnalysis
Role of Statistics in Data Analysis
◦ Helps in collecting, organizing, analyzing, and interpreting data.
◦ Converts raw data into meaningful insights.
◦ Essential for decision-making, prediction, and research validation.
21.
Types of StatisticalAnalysis
Descriptive Statistics
◦ Summarizes data.
◦ Measures: mean, median, mode, variance, standard deviation.
◦ Example: Average exam score of a class.
Inferential Statistics
◦ Draws conclusions about a population from a sample.
◦ Techniques: hypothesis testing, confidence intervals, regression analysis.
◦ Example: Predicting election results from a survey.
Inferential statistics
A manufacturerclaims that the average fuel efficiency of their new car model is 25 mpg. A
random sample of 10 cars gave the following mileages:
23, 25, 27, 24, 26, 22, 28, 25, 24, 23
Test at 5% significance level whether the claim is true using a one-sample t-test.
Step 1: State Hypotheses
Null Hypothesis (H ):
₀ μ = 25 (average mileage = 25 mpg)
Alternative Hypothesis (H ):
₁ μ ≠ 25 (average mileage is different from 25 mpg)
This is a two-tailed test.
.
Inferential statistics
One Samplet-test
data: mileage
t = -0.50233, df = 9, p-value = 0.6275
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
23.349 26.051
sample estimates: mean of x : 24.7
29.
Inferential statistics
Step 4:Inference
Test statistic t = -0.50233,
p-value = 0.6275
Since p > 0.05, we fail to reject H₀.
Conclusion: At 5% significance level, there is no significant evidence to say that the mean
mileage is different from 25 mpg. The manufacturer’s claim is reasonable.
30.
Inferential statistics
Chi-Square Test
Adie is suspected to be biased. To test this, it is rolled 60 times, and the observed frequencies of
outcomes are:
Test at the 5% significance level whether the die is fair using the Chi-square goodness of fit
test.
Face 1 2 3 4 5 6
Observed
(O)
8 9 10 12 11 10
31.
•To find thegoodness of fit
# Observed frequencies
observed <- c(8, 9, 10, 12, 11, 10)
# Expected frequencies
expected_prob <- rep(1/6, 6) # probabilities for 6 faces
# Chi-square Goodness of Fit Test
test <- chisq.test(x = observed, p = expected_prob)
print(test)
Inferential statistics
32.
Inferential statistics
Chi-squared testfor given probabilities
data: observed
X-squared = 1, df = 5, p-value = 0.9626
Here , p-value = 0.9626 >0.05 . Therefore fail to reject null hypothesis.
Conclusion: At 5% significance level, there is no significant evidence to say that the die is
unfair. The die appears fair.
33.
Inferential statistics
A researcherwants to test whether the mean test scores differ among three different teaching
methods. The scores of students are recorded as follows:
Group A (Method 1): 85, 90, 88
Group B (Method 2): 70, 75, 80
Group C (Method 3): 95, 92, 89
Is there a significant difference in mean scores among the three teaching methods at the 5%
significance level?
34.
Step 1: Hypotheses
NullHypothesis (H ):
₀ μ = μ = μ (all groups have equal mean scores).
₁ ₂ ₃
Alternative Hypothesis (H ):
₁ At least one group mean differs.
ANOVA (Analysis of Variance)
•Compare means across 3+ groups
df <- data.frame( score = c(85,90,88,70,75,80,95,92,89),
group = rep(c("A","B","C"), each=3))
aov_res <- aov(score ~ group, data=df)
summary(aov_res)
Inferential statistics
This tells ushow much larger the between-group variance is compared to the within-group
variance.
The probability of getting an F-value as large as 17.41 (or larger) if H is true
₀ .
Since p = 0.00317 < 0.05, we reject the null hypothesis.
Conclusion: At least one group mean is significantly different.
Inferential statistics
37.
Probability Distributions inR
Binomial Distribution
A fair coin (p = 0.5 for heads) is tossed 10 times.
Find the probability of getting exactly 5 heads.
Simulate 10 random outcomes of tossing the coin 10 times.
Here,
n= 10 (number of trials)
K=5 (success)
p=0.5
38.
# Probability of5 successes in 10 trials with p=0.5
dbinom(5, size=10, prob=0.5)
# Generate random binomial values
rbinom(10, size=10, prob=0.5)
Output Example:
[1] 0.246 # Probability of exactly 5 successes
[1] 6 5 4 7 5 6 5 4 3 7 # Random outcomes
Probability Distributions in R
39.
Conclusion for theProblem
The probability of getting exactly 5 heads in 10 coin tosses is 0.2461 (≈ 24.6%).
◦ This means if we repeat the experiment many times, about 1 in 4 trials will result in exactly 5
heads.
The random simulation using rbinom() shows how the number of heads can vary across
repeated experiments of 10 tosses each.
◦ The values fluctuate around 5, consistent with the theoretical expectation .
Overall: The binomial model confirms that getting exactly 5 successes (heads) in 10 fair coin
tosses is the most likely outcome, but not guaranteed—it occurs about 25% of the time.
Probability Distributions in R
40.
Probability Distributions inR
Poisson Distribution
A call center receives an average of 3 calls per minute.
Find the probability that exactly 5 calls are received in a given minute.
Find the probability that at most 2 calls are received in a given minute.
Simulate the number of calls received in 10 minutes using the Poisson distribution.
41.
# 1. Probabilityof exactly 5 calls (lambda = 3)
dpois(5, lambda = 3)
# 2. Probability of at most 2 calls
ppois(2, lambda = 3)
# 3. Simulate number of calls in 10 minutes
rpois(10, lambda = 3)
Probability Distributions in R
42.
# 1. Probabilityof exactly 5 calls
dpois(5, lambda = 3)
[1] 0.1008188
# 2. Probability of at most 2 calls
> ppois(2, lambda = 3)
[1] 0.4231901
# 3. Simulate number of calls in 10 minutes
rpois(10, lambda = 3)
[1] 3 3 1 4 2 2 3 2 0 8
Probability Distributions in R
43.
Conclusion
The probability ofgetting exactly 5 calls in one minute is ≈ 10%.
The probability of getting at most 2 calls is ≈ 42%.
The simulation shows how the number of calls fluctuates around the average ().
.
Probability Distributions in R
44.
Probability Distributions inR
1. Normal Distribution
The exam scores of students in a class are normally distributed with a mean (μ) = 70 and a
standard deviation (σ) = 10.
Find the probability that a randomly selected student scores less than 80.
Find the probability that a student scores between 60 and 75.
Simulate the exam scores of 10 students using the given distribution.
45.
# 1. Probabilityof scoring less than 80
pnorm(80, mean = 70, sd = 10)
# 2. Probability of scoring between 60 and 75
pnorm(75, mean = 70, sd = 10) - pnorm(60, mean = 70, sd = 10)
# 3. Simulate exam scores for 10 students
rnorm(10, mean = 70, sd = 10)
Probability Distributions in R
46.
P(X < 80):
pnorm(80,mean = 70, sd = 10)
# [1] 0.8413447
There is about 84.13% chance that a student scores less than 80.
P(60 < X < 75):
pnorm(75, mean = 70, sd = 10) - pnorm(60, mean = 70, sd = 10)
# [1] 0.5328072
There is about 53.28% chance that a student scores between 60 and 75.
Probability Distributions in R
47.
Simulated scores:
rnorm(10, mean= 70, sd = 10)
# Example output: [1] 68.5 72.3 81.0 59.4 74.2 65.8 69.1 77.6 62.8 71.4
These are 10 randomly generated exam scores based on the normal distribution.
Conclusion
Most students are likely to score below 80 (84% probability).
Over half the students (53% probability) will fall between 60 and 75.
The simulated results reflect how scores cluster around the mean (70) with some variation due to
standard deviation.
Probability Distributions in R
48.
Entrepreneurship With Rsoftware
Entrepreneurship utilizes the R software for business analytics, enabling data-driven decision-making
through powerful statistical analysis, forecasting, and visualization of trends in customer behavior,
sales, and system performance. Entrepreneurs can leverage R to load, manipulate, and visualize
complex datasets, identify patterns, conduct predictive modeling, and generate actionable insights from
market and operational data to gain a competitive edge.
Business Analytics & Data Mining:
R is an open-source statistical software environment used for high-end graphics and statistical
computations, making it a powerful tool for business analytics and data mining.
49.
Entrepreneurship With Rsoftware
Data Visualization:
Entrepreneurs can use packages like ggplot2 to create clear, easy-to-read charts and graphs that
transform raw data into impactful visualizations, helping to identify trends and patterns in data.
Predictive Analytics:
R enables businesses to forecast trends, classify outcomes, and analyze time-dependent data using
regression analysis, classification models, and time series forecasting.
Customer Behavior Analysis:
By analyzing customer data, entrepreneurs can understand needs and preferences, leading to more
tailored products and services.
50.
Sales & PerformanceForecasting:
R can be used to forecast sales, model system performance, and predict potential losses,
supporting smarter, data-driven business decisions.
Quantitative Research:
R facilitates the analysis of public and proprietary datasets, helping entrepreneurs conduct
research, test theories, and generate strategic insights.
Market Trend Analysis:
Entrepreneurs can analyze social media trends and market data to refine strategies, improve
customer engagement, and make informed business decisions.
Entrepreneurship With R software