Unit 4 Statistical Data Analysis for BTech 5th SEM

UNIT IV
Statistical Data Analysis & Inference
Notes by:
Dr. Seema Gulati

Populations and Samples in Statistical
Analysis
• Population: The complete set of all individuals,
objects, or events of interest in a study.
• Example: All registered voters in a country.
• Characteristics are called parameters (e.g., population mean μ, population
proportion p).
• Sample: A subset of the population selected for
analysis.
• Example: 1,000 voters surveyed before an election.
• Characteristics are called statistics (e.g., sample mean x
̄ , sample
proportion p
̂ ).

Why Use Samples Instead of Populations?
 Cost & Practicality: Measuring an entire population is often
expensive or impossible.
 Time Efficiency: Sampling allows faster data collection and
analysis.
 Feasibility: Some tests are destructive (e.g., crash-testing cars).
 Accuracy: Proper sampling can yield highly precise estimates.

Sampling Methods
 Probability Sampling (Random Selection)
 Non-Probability Sampling (Non-Random)

Probability Sampling
• Simple Random Sampling: Every member has an equal chance of
selection.
• Stratified Sampling: Population divided into subgroups (strata),
then random samples are taken from each.
• Cluster Sampling: Population divided into clusters (e.g., cities),
and entire clusters are randomly selected.
• Systematic Sampling: Selecting every k-th element from a list
(e.g., every 10th person).

Non-Probability Sampling
• Convenience Sampling: Using readily available subjects (e.g.,
online surveys).
• Purposive Sampling: Selecting subjects based on researcher’s
judgment.
• Snowball Sampling: Existing participants recruit others (used in
hard-to-reach populations).

Sampling Bias & Errors
• Sampling Bias: When the sample is not representative of the
population.
• Example: Surveying only college students about national voting
trends.
• Non-Sampling Errors: Mistakes in data collection (e.g.,
measurement errors, response bias).
• Sampling Error: Natural variation between sample statistics and
population parameters.

Example: Population vs. Sample in
Research
Aspect Population Example Sample Example
Definition All adults in India 1,000 surveyed adults
Parameter/Statistic
True average income
(μ)
Sample average
income (x
̄ )
Data Collection
Census (rarely
feasible)
Surveys, experiments

Statistical Modelling
Statistical modelling involves using mathematical equations to
represent relationships in data. Models help:
• Describe patterns in observed data
• Predict future outcomes
• Infer causal relationships
Statistical models can be broadly categorized into descriptive,
predictive, prescriptive, and inferential models, each serving a unique
purpose, from summarizing data to making predictions and
recommendations.

1. Descriptive Models:
Purpose: Summarize and describe the characteristics of a dataset.
Examples:
 Descriptive Statistics: Calculating measures like mean, median,
mode, standard deviation, and percentiles.
 Frequency Distributions and Histograms: Visualizing the
distribution of data.
 Scatterplots and Line Graphs: Showing relationships between
variables.

2. Predictive Models:
Purpose: Predict future outcomes or values based on historical data.
Examples:
 Regression Models: Predicting a continuous outcome variable (e.g., sales,
temperature) based on one or more predictor variables.
 Classification Models: Predicting a categorical outcome (e.g., spam/not spam,
customer churn).
 Time Series Analysis: Forecasting future values based on past data patterns.
 Machine Learning Algorithms: Using algorithms like decision trees, random
forests, and neural networks for complex predictions.

3. Prescriptive Models:
Purpose: Provide recommendations or guidance for decision-making.
Examples:
 Optimization Models: Finding the best solution to a problem (e.g., resource
allocation, scheduling).
 Decision Analysis: Evaluating different options and their potential
outcomes.
 Simulation Models: Simulating real-world scenarios to test different
strategies.

4. Inferential Models:
Purpose: Draw conclusions about a population based on a sample of data.
Examples:
 Hypothesis Testing: Determining if there is a statistically significant
difference between groups or variables.
 Confidence Intervals: Estimating the range within which a population
parameter is likely to fall.
 Statistical Tests: Using tests like t-tests, chi-squared tests, and ANOVA
to analyze data.

Probability Fundamentals
 Probability quantifies uncertainty, ranging from 0 (impossible) to 1 (certain).
• Random Variable (RV): A variable whose possible values are outcomes of a
random phenomenon.
• Discrete (e.g., dice rolls)
• Continuous (e.g., height measurements)
• Probability Rules:
• Addition Rule: P(A B) = P(A) + P(B) − P(A ∩ B)
∪
• Multiplication Rule: P(A ∩ B) = P(A) × P(B|A)
• Bayes’ Theorem: Updates probabilities based on new evidence:

Probability Distributions
 Probability distributions describe how probabilities are distributed
over a random variable’s values.
A. Discrete Distributions
Distribution Description Example Use
Bernoulli
Binary outcomes
(success/failure)
Coin toss
Binomial
Count of successes
in n trials
Number of defective
items in a batch
Poisson
Counts of rare events in
fixed intervals
Website visits per
hour

Probability Distributions
B. Continuous Distributions
Distribution Description Example Use
Normal (Gaussian) Symmetric, bell-shaped Heights, IQ scores
Exponential Time between events Waiting times
Uniform
Equal probability over
range
Random number
generation

Getting started with R
 1. Redirect to https://cloud.r-project.org/
 2. Download and install R

What is R?
 R is a popular programming language used for statistical computing
and graphical presentation.
 Its most common use is to analyze and visualize data.

Why Use R?
 It is a great resource for data analysis, data visualization, data
science and machine learning
 It provides many statistical techniques (such as statistical tests,
classification, clustering and data reduction)
 It is easy to draw graphs in R, like pie charts, histograms, box plot,
scatter plot, etc++
 It works on different platforms (Windows, Mac, Linux)
 It is open-source and free
 It has a large community support
 It has many packages (libraries of functions) that can be used to
solve different problems

Feature R Python
Introduction
R is a language and environment for
statistical programming which
includes statistical computing and
graphics.
Python is a general-purpose
programming language for data
analysis and scientific computing
Objective
It has many features which are useful
for statistical analysis and
representation.
It can be used to develop GUI
applications and web applications as
well as with embedded systems
Workability
It has many easy-to-use packages for
performing tasks
It can easily perform matrix
computation as well as optimization
Integrated
development
environment
Various popular R IDEs are Rstudio,
RKward, R commander, etc.
Various popular Python IDEs are
Spyder, Eclipse+Pydev, Atom, etc.
Libraries and packages
There are many packages and
libraries like ggplot2, caret, etc.
Some essential packages and libraries
are Pandas, Numpy, Scipy, etc.
Scope
It is mainly used for complex data
analysis in data science.
It takes a more streamlined approach
for data science projects.

R Syntax
 To output text in R, use single or double quotes:
"Hello World!“
 To output numbers, just type the number (without quotes):
5
10
25
 To do simple calculations, add numbers together:
5 + 5

R Print Output
 You can output code in R without using a print function:
"Hello World!"
 A print() function is available:
print("Hello World!")
 There are times you must use the print() function to output code, for
example, when working with for loops:
for (x in 1:10) {
print(x)
}

Comments
 Comments can be used to explain R code, and to make it more readable.
It can also be used to prevent execution when testing alternative code.
 Comments starts with a #. When executing code, R will ignore anything that
starts with #.
# This is a comment
"Hello World!“
"Hello World!" # This is a comment

Multiline Comments
 There is no syntax in R for multiline comments.
 However, we can just insert a # for each line to create multiline comments:
# This is a comment
# written in
# more than just one line
"Hello World!"

Creating Variables in R
 Variables are containers for storing data values.
 R does not have a command for declaring a variable. A variable is created the
moment you first assign a value to it. To assign a value to a variable, use the <-
sign. To output (or print) the variable value, just type the variable name:
name <- "John"
age <- 40
name # output "John"
age # output 40
 In other programming language, it is common to use = as an assignment operator.
In R, we can use both = and <- as assignment operators.
 However, <- is preferred in most cases because the = operator can be forbidden in
some contexts in R.

Print / Output Variables
 Compared to many other programming languages, you do not have to
use a function to print/output variables in R. You can just type the name
of the variable:
name <- "John Doe"
name # auto-print the value of the name variable
 R does have a print() function available if you want to use it.
name <- "John Doe"
print(name) # print the value of the name variable

R Concatenate Elements
 You can also concatenate, or join, two or more elements, by using the
paste() function.
text <- "awesome"
paste("R is", text)
 You can also use , to add a variable to another variable:
text1 <- "R is"
text2 <- "awesome"
paste(text1, text2)

 For numbers, the + character works as a mathematical operator:
num1 <- 5
num2 <- 10
num1 + num2
 If you try to combine a string (text) and a number, R will give you
an error:
num <- 5
text <- "Some text"
num + text
Result:
Error in num + text : non-numeric argument to binary operator

R Multiple Variables
 R allows you to assign the same value to multiple variables in one line:
# Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
# Print variable values
var1
var2
var3

R Variable Names (Identifiers)
 A variable can have a short name (like x and y) or a more descriptive
name (age, carname, total_volume). Rules for R variables are:
 A variable name must start with a letter and can be a combination of
letters, digits, period(.)
 and underscore(_). If it starts with period(.), it cannot be followed by a
digit.
 A variable name cannot start with a number or underscore (_)
 Variable names are case-sensitive (age, Age and AGE are three different
variables)
 Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)

# Legal variable names:
myvar <- "John"
my_var <- "John"
myVar <- "John"
MYVAR <- "John"
myvar2 <- "John"
.myvar <- "John"
# Illegal variable names:
2myvar <- "John"
my-var <- "John"
my var <- "John"
_my_var <- "John"
my_v@ar <- "John"
TRUE <- "John"

Basic Data Types
 numeric - (10.5, 55, 787)
 integer - (1L, 55L, 100L, where the letter "L" declares this as an
integer)
 complex - (9 + 3i, where "i" is the imaginary part)
 character (string) - ("k", "R is exciting", "FALSE", "11.5")
 logical (boolean) - (TRUE or FALSE)

x <- 10.5
class(x) # numeric
x <- 1000L
class(x) # integer
x <- 9i + 3
class(x) # complex
x <- "R is exciting"
class(x) # character/string
x <- TRUE
class(x) # logical/boolean

Simple Math
 In R, you can use operators to perform common mathematical
operations on numbers.
 The + operator is used to add together two values:
10+5
 And the - operator is used for subtraction:
10-5

Built-in Math Functions
 R also has many built-in math functions that allows you to perform
mathematical tasks on numbers.
 For example, the min() and max() functions can be used to find the
lowest or highest number in a set:
 max(5, 10, 15)
min(5, 10, 15)

 sqrt(): function returns the square root of a number:
sqrt(16)
 abs(): function returns the absolute (positive) value of a
number:
abs(-4.7)
 ceiling() and floor(): The ceiling() function rounds a number
upwards to its nearest integer, and the floor() function rounds a
number downwards to its nearest integer, and returns the
result:
ceiling(1.4)
floor(1.4)

String Literals
 Strings are used for storing text.
 A string is surrounded by either single quotation marks, or double
quotation marks:
 "hello" is the same as 'hello’:
 Assigning a string to a variable is done with the variable followed
by the <- operator and the string:

Multiline Strings
 You can assign a multiline string to a variable like this:

String Length
 There are many useful string functions in R.
 For example, to find the number of characters in a string, use the
nchar() function:

Check a String
 Use the grepl() function to check if a character or a sequence of
characters are present in a string:

Assignment Operators
1. = (Simple Assignment)
2. <- (Leftward Assignment)
3. -> (Rightward Assignment)

my_var <- 3
my_var # print my_var
my_var <<- 3
3 -> my_var
3 ->> my_var
<<- is a global assigner. It is also possible to turn the direction of the
assignment operator.
x <- 3 is equal to 3 -> x

The if Statement
 An "if statement" is written with the if keyword, and it is used to
specify a block of code to be executed if a condition is TRUE:
 a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}

If Else
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}

Nested If Statements
x <- 41
if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}

AND
 The & symbol (and) is a logical operator, and is used to combine
conditional statements:
a <- 200
b <- 33
c <- 500
if (a > b & c > a) {
print("Both conditions are true")
}

OR
 The | symbol (or) is a logical operator, and is used to combine
conditional statements.

R While Loop
 With the while loop we can execute a set of statements as long as a
condition is TRUE:

Break
 With the break statement, we can stop the loop even if the while
condition is TRUE.
 The loop stops at 3 because we have chosen to finish the loop by using
the break statement when i is equal to 4 (i == 4).

For Loop
 A for loop is used for iterating over a sequence.

Functions in R
 To create a function, use the
function() keyword:
 Arguments are specified after the function
name, inside the parentheses. You can add as
many arguments as you want, just separate
them with a comma.

R recursive factorial function

Vectors in R
 A vector is simply a list of items that are of the same type.
 To combine the list of items into a vector, use the c() function and
separate the items by a comma.

To create a vector with numerical values in a sequence,
use the : operator:

R Lists
 A list in R can contain many different data types inside it. A list is a
collection of data which is ordered and changeable.
# List of strings
thislist <- list("apple", "banana", "cherry")
# Print the list
thislist
thislist[1]
thislist[1] <- "blackcurrant"
# Print the updated list
thislist

To add an item to the right of a specified index, add "after=index
number" in the append() function:

Remove List Items
The following example creates a new, updated list without an "apple"
item:

Join Lists
 There are several ways to join, or
concatenate, two or more lists in R.
 The most common way is to use the c()
function, which combines two elements
together:

R Matrices
 A matrix is a two-dimensional data set with columns and rows.
 A column is a vertical representation of data, while a row is a
horizontal representation of data.
 A matrix can be created with the matrix() function. Specify the nrow
and ncol parameters to get the amount of rows and columns:
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
thismatrix

Creating a string type matrix
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix
You can access the items by using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the column-
position:

The whole row can be accessed if you specify a comma after the
number in the bracket:
The whole column can be accessed if you specify a comma before the
number in the bracket:
More than one row can be accessed if you use the c() function:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange", "grape",
"pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3)
thismatrix[c(1,2),]

Arrays
 Compared to matrices, arrays can have more than two dimensions.
 We can use the array() function to create an array, and the dim parameter to specify the
dimensions.

Accessing Arrays
 You can access the array elements by referring to the index position.
You can use the [] brackets to access the desired elements from an
array:

 You can also access the whole row or column from a matrix in an
array, by using the c() function:

Check if an Item Exists
 To find out if a specified item is present in an array, use the %in%
operator:

Number of Rows and Columns
 Use the dim() function to find the amount of rows and columns in
an array:

Array Length
 Use the length() function to find the dimension of an array:

R Data Frames
 Data Frames are data displayed in a format as a table.
 Data Frames can have different types of data inside it.
While the first column can be character, the second
and third can be numeric or logical. However, each
column should have the same type of data.
 Use the data.frame() function to create a data frame:

Summarize the Data
 Use the summary() function to summarize the data from
a Data Frame:

Access Items in Frames
 We can use single brackets [ ], double brackets [[ ]] or $
to access columns from a data frame:

Add Rows
 Use the rbind() function to add new rows in a Data
Frame:

Add Columns
 Use the cbind() function to add new columns in a Data
Frame:

Remove Rows and Columns
 Use the c() function to remove rows and columns in a
Data Frame:

Amount of Rows and Columns
 Use the dim() function to find the amount of rows and
columns in a Data Frame:

Data Frame Length
 Use the length() function to find the number of columns
in a Data Frame (similar to ncol()):

Combining Data Frames
Use the rbind() function to combine two or more data frames in R vertically:
Data_Frame1 <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Training = c("Stamina", "Stamina", "Strength"),
Pulse = c(140, 150, 160),
Duration = c(30, 30, 20)
)
New_Data_Frame <- rbind(Data_Frame1, Data_Frame2)
New_Data_Frame

 Use the cbind() function to combine two or more data
frames in R horizontally:
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Steps = c(3000, 6000, 2000),
Calories = c(300, 400, 300)
)
New_Data_Frame1 <- cbind(Data_Frame3, Data_Frame4)
New_Data_Frame1

Factors
Factors are used to categorize data. Examples of factors are:
 Demography: Male/Female
 Music: Rock, Pop, Classic, Jazz
 Training: Strength, Stamina
To create a factor, use the factor() function and add a vector as argument:
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "
Pop", "Jazz", "Rock", "Jazz"))
# Print the factor
music_genre

 From the example above that you can see the factor has four levels
(categories): Classic, Jazz, Pop and Rock.
 To only print the levels, use the levels() function:
music_genre <-factor(c
("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
levels(music_genre)

Set the levels, by adding the levels argument inside the
factor() function:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic",
"Pop", "Jazz", "Rock", "Jazz"), levels =
c("Classic", "Jazz", "Pop", "Rock", "Other"))
levels(music_genre)

Factor Length
 Use the length() function to find out how many items there are in the
factor:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")
)
length(music_genre)

Access Factors
 To access the items in a factor, refer to the index number, using []
brackets:

Change Item Value
 To change the value of a specific item, refer to the index number:
 You cannot change the value of a specific item if it is not already
specified in the factor.

 However, if you have already specified it inside the levels argument,
it will work:
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "J
azz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Opera"))
music_genre[3] <- "Opera"
music_genre[3]

R Plotting
 The plot() function is used to draw points (markers) in a diagram.
 The function takes parameters for specifying points in the diagram.
 Parameter 1 specifies points on the x-axis.
 Parameter 2 specifies points on the y-axis.
 The plot() function to plot two numbers against each other:

Example
Draw one point in the diagram, at position (1) and position (3):
plot(1, 3)

To draw more points, use vectors:
Draw two points in the diagram, one at position (1, 3) and one in
position (8, 10):
plot(c(1, 8), c(3, 10))

Multiple Points
We can plot as many points we need, but the number of points in both
axis should be the same:

For better organization, when you have many values, it is better to use variables:
x <- c(1, 2, 3, 4, 5)
y <- c(3, 7, 8, 9, 12)
plot(x, y)

Sequences of Points
plot(1:10)

Draw a Line
 The plot() function also takes a type parameter with the value l to
draw a line to connect all the points in the diagram:
 Plot(1:10,type=‘l’)

Plot Labels
The plot() function also accept other parameters, such as main, xlab
and ylab if you want to customize the graph with a main title and
different labels for the x and y-axis:
Example
plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")

Graph Appearance
There are many other parameters you can use to change the appearance
of the points.
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
Size
Use cex=number to change the size of the points (1 is default, while
0.5 means 50% smaller, and 2 means 100% larger):
Example
plot(1:10, cex=2)

Point Shape
Use pch with a value from 0 to 25 to change the point shape format:
plot(1:10, pch=25, cex=2)

The values of the pch parameter range from 0 to 25, which means that
we can choose up to 26 different types of point shapes:

R Line
Line Graphs
A line graph has a line that connects all the points in a diagram.
To create a line, use the plot() function and add the type parameter with
a value of "l":
plot(1:10, type="l")

Line Color
The line color is black by default. To change the color, use the col
parameter:
Example
plot(1:10, type="l", col="blue")

Line Width
To change the width of the line, use the lwd parameter (1 is default,
while 0.5 means 50% smaller, and 2 means 100% larger):
plot(1:10, type="l", lwd=2)

Line Styles
The line is solid by default. Use the lty parameter with a value from 0
to 6 to specify the line format.
For example, lty=3 will display a dotted line instead of a solid line:
plot(1:10, type="l", lwd=5, lty=3)

Available parameter values for lty:
• 0 removes the line
• 1 displays a solid line
• 2 displays a dashed line
• 3 displays a dotted line
• 4 displays a "dot dashed" line
• 5 displays a "long dashed" line
• 6 displays a "two dashed" line

Multiple Lines
To display more than one line in a graph, use the plot() function together with
the lines() function:
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
plot(line1, type = "l", col = "blue")
lines(line2, type="l", col = "red")

R Scatter Plot
A "scatter plot" is a type of plot used to display the relationship
between two numerical variables, and plots one dot for each
observation.
It needs two vectors of same length, one for the x-axis (horizontal) and
one for the y-axis (vertical):
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)

x <- c(5,7,8,7,2,2,9,4,11,12,9,6) #Car Age
y <- c(99,86,87,88,111,103,87,94,78,77,85,86) #Car Speed
plot(x, y, main="Observation of Cars", xlab="Car age", ylab="Car
speed")
The result of 12 cars passing by

Inferences
 The x-axis shows how old the car is.
 The y-axis shows the speed of the car when it passes.
 Are there any relationships between the observations?
 It seems that the newer the car, the faster it drives, but that could be
a coincidence, after all we only registered 12 cars.

Compare Plots
 There seems to be a relationship between the car speed and age, but what if we plot
the observations from another day as well? Will the scatter plot tell us something else?
 To compare the plot with another plot, use the points() function:
x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6) # day one, the age and speed of 12 cars
y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
x2 <- c(2,2,8,1,15,8,12,9,7,3,11,4,7,14,12) # day two, the age and speed of 15 cars
y2 <- c(100,105,84,105,90,99,90,95,94,100,79,112,91,80,85)
plot(x1, y1, main="Observation of Cars", xlab="Car age", ylab="Car speed", col="red",
cex=2)
points(x2, y2, col="blue", cex=2)

R Pie Charts
A pie chart is a circular graphical view of data.
Use the pie() function to draw pie charts:
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart
pie(x)

 The pie chart draws one pie for each value in the vector (in this
case, 10, 20, 30, 40).
 By default, the plotting of the first pie starts from the x-axis and
move counterclockwise.
 The size of each pie is determined by comparing the value with all
the other values, by using the formula:
The value divided by the sum of all values: x/sum(x)

Start Angle
 To change the start angle of the pie chart with the init.angle
parameter.
 The value of init.angle is defined with angle in degrees, where
default angle is 0.
 # Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart and start the first pie at 90 degrees
pie(x, init.angle = 90)

Labels and Header
 Use the label parameter to add a label to the pie chart, and use the
main parameter to add a header:
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Display the pie chart with labels
pie(x, label = mylabel, main = "Fruits")

Colors
 You can add a color to each pie with the col parameter:
Example
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Fruits", col = colors)

Legend
 To add a list of explanation for each pie, use the legend() function:
Example:
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Pie Chart", col = colors)
# Display the explanation box
legend("bottomright", mylabel, fill = colors)

 The legend can be positioned as either:
bottomright, bottom, bottomleft, left, topleft, top, topright, right, center

R Bar Charts
 A bar chart uses rectangular bars to visualize data. Bar charts can be
displayed horizontally or vertically. The height or length of the bars are
proportional to the values they represent.
 Use the barplot() function to draw a vertical bar chart:
 # x-axis values
x <- c("A", "B", "C", "D")
# y-axis values
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x)

In the above example:
 The x variable represents values in the x-axis (A,B,C,D)
 The y variable represents values in the y-axis (2,4,6,8)
 Then we use the barplot() function to create a bar chart of the values
 names.arg defines the names of each observation in the x-axis

Bar Color
 Use the col parameter to change the color of the bars:
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, col = "red")

Density / Bar Texture
 To change the bar texture, use the density parameter:
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, density = 10)

Bar Width
 Use the width parameter to change the width of the bars:
Example:
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, width = c(1,2,3,4))

Horizontal Bars
 If you want the bars to be displayed horizontally instead of vertically, use
horiz=TRUE:
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, horiz = TRUE)

Statistics Introduction
 Statistics is the science of analysing, reviewing and concluding data.
 Some basic statistical numbers include:
• Mean, median and mode
• Minimum and maximum value
• Percentiles
• Variance and Standard Deviation
• Covariance and Correlation
• Probability distributions
 The R language was developed by two statisticians. It has many
built-in functionalities, in addition to libraries for the exact purpose
of statistical analysis.

R Data Set
 A data set is a collection of data, often presented in a table.
 There is a popular built-in data set in R called "mtcars" (Motor
Trend Car Road Tests), which is retrieved from the 1974 Motor
Trend US Magazine.
 Example:
# Print the mtcars data set
mtcars
OUTPUT: next slide

Information About the Data Set
 The question mark (?) to get information about the mtcars data set:
 # Use the question mark to get information about the data set
?mtcars
Redirects to URL:
http://127.0.0.1:17504/library/datasets/html/mtcars.html

Get Information
 Use the dim() function to find the dimensions of the data set, and
the names() function to view the names of the variables:
Data_Cars <- mtcars # create a variable of the mtcars data set for better
organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from the data set
names(Data_Cars)

Use the rownames() function to get the name of each row in the first
column, which is the name of each car:
Data_Cars <- mtcars
rownames(Data_Cars)

 From the examples above, we have found out that the data set
has 32 observations (Mazda RX4, Mazda RX4 Wag, Datsun 710,
etc) and 11 variables (mpg, cyl, disp, etc).
 A variable is defined as something that can be measured or counted.
 A brief explanation of the variables from the mtcars data set:
Variable Name Description
mpg Miles/(US) Gallon
cyl Number of cylinders
disp Displacement
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Print Variable Values
If you want to print all values that belong to a variable, access the data
frame by using the $ sign, and the name of the variable (for example
cyl (cylinders)):
Data_Cars <- mtcars
Data_Cars$cyl

Sort Variable Values
 To sort the values, use the sort() function:

Analyzing the Data
 Now that we have some information about the data set, we can start
to analyze it with some statistical numbers.
 For example, we can use the summary() function to get a statistical
summary of the data:

 The summary() function returns six statistical numbers for each
variable:
• Min
• First quantile (percentile)
• Median
• Mean
• Third quantile (percentile)
• Max

Max and Min
 You learned from the R Math chapter that R has several built-in math functions.
For example, the min( ) and max( ) functions can be used to find the lowest or
highest value in a set:

 By observing the table, it looks like the largest hp value belongs to
a Maserati Bora, and the lowest belongs to a Honda Civic.
 However, it is much easier (and safer) to let R find out this for us.
 For example, we can use the which.max( ) and which.min( )
functions to find the index position of the max and min value in the
table:

Or even better, combine which.max( ) and which.min( ) with the
rownames( ) function to get the name of the car with the largest and
smallest horsepower:
The Maserati Bora is the car with the highest horsepower, and the
Honda Civic is the car with the lowest horsepower.

Outliers
 Max and min can also be used to detect outliers. An outlier is a data
point that differs from rest of the observations.
 Example of data points that could have been outliers in the mtcars
data set:
• If maximum of forward gears of a car was 11
• If minimum of horsepower of a car was 0
• If maximum weight of a car was 50 000 lbs

R Mean
In statistics, there are often three values that interest us:
• Mean - The average value
• Median - The middle value
• Mode - The most common value

Mean
 To calculate the average value (mean) of a variable from the mtcars
data set, find the sum of all values, and divide the sum by the
number of values.
 Example: Sorted observation of wt. (weight)
 Find the average weight (wt) of a car:
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

Median
 The median value is the value in the middle, after you have sorted
all the values.
 Find the mid point value of weight (wt):

Mode
 The mode value is the value that appears the most number of times.
 R does not have a function to calculate the mode. However, we can
create our own function to find it.
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

R Percentiles
 Percentiles are used in statistics to give you a number that describes
the value that a given per cent of the values are lower than.
 If we take a look at the values of the wt (weight) variable from the
mtcars data set:
 What is the 75. percentile of the weight of the cars? The answer is
3.61 or 3 610 lbs, meaning that 75% or the cars weight 3 610 lbs or
less:

 If you run the quantile() function without specifying the c() parameter, you
will get the percentiles of 0, 25, 50, 75 and 100:

 Quartiles are data divided into four parts, when sorted in an
ascending order:
• The value of the first quartile cuts off the first 25% of the data
• The value of the second quartile cuts off the first 50% of the data
• The value of the third quartile cuts off the first 75% of the data
• The value of the fourth quartile cuts off the 100% of the data

Unit 4 Statistical Data Analysis for BTech 5th SEM

More Related Content

Similar to Unit 4 Statistical Data Analysis for BTech 5th SEM

Recently uploaded

Unit 4 Statistical Data Analysis for BTech 5th SEM

Editor's Notes