To perform basic calculations, descriptive statistics, frequency distribution and plotting various
graphs using R software
MADE BY
SHRIKRISHNA KESHARWANI
A Report On–
EXPERIMENT – 5
(Data Analysis Using R (Basic))
Submitted by-
SHRIKRISHNA KESHARWANI
Roll no.-
22CEM3R23
Subject-
TRANSPORTATION ANALYTICS LABORATORY
Bachelor of Technology
In
TRANSPORTATION ENGINEERING
DEPARTMENT OF CIVIL ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL
OCTOBER, 2022
2.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 2
Table of Contents
i. Objective-...........................................................................................................................4
ii. Software Used-...................................................................................................................4
iii. Concept and Theory...........................................................................................................4
3.1 R language-......................................................................................................................4
3.2 Using R as a calculator ....................................................................................................4
3.3 Descriptive Statistics........................................................................................................4
3.3.1. Mean: .......................................................................................................................4
Mean(x) provides the values of arithmetic mean of the data in data vector x...................4
3.3.2. Median: ....................................................................................................................4
3.3.3 Mode-........................................................................................................................4
3.3.4. Range: ......................................................................................................................4
3.3.5 Class Interval: ...........................................................................................................5
3.3.6 Standard Deviation: ..................................................................................................5
3.3.7 Variance:...................................................................................................................5
3.3.8 Skewness:..................................................................................................................5
3.3.9 Kurtosis:....................................................................................................................5
3.3.10 Histogram:...............................................................................................................6
3.4 Frequency distribution Curve: .........................................................................................7
3.5 Ogive curve/ S curve/ Cumulative frequency curve:.......................................................7
3.6 Graphical tools.................................................................................................................7
3.6.1 - Bar diagrams- .........................................................................................................7
3.6.2 Pie diagrams- ............................................................................................................7
3.6.3 Histogram- ................................................................................................................7
3.6.4 Kernel density-..........................................................................................................7
3.6.7 Stem and leaf plots etc… ..........................................................................................7
3.6.8 Boxplots-...................................................................................................................8
3.7 Quantiles-.........................................................................................................................8
3.7.1 Quartiles:..............................................................................................................8
3.7.2 Deciles: ................................................................................................................8
3.7.3 Percentiles.................................................................................................................8
iv. Procedure: ..........................................................................................................................9
3.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 3
v. Data Analysis:....................................................................................................................9
5.1 Basic calculations (codes)- ..............................................................................................9
5.2 Other functions. (Codes)................................................................................................10
5.3 Missing data, Quantiles and descriptive statistics (codes)- ...........................................10
5.4 Frequency distribution and cumulative frequency diagram- .........................................14
5.5 Graphics and plots- ........................................................................................................19
6. Results & Discussion:..........................................................................................................26
7. Conclusion: .........................................................................................................................26
Reference .................................................................................................................................26
List of Figures-
Figure 1 positively and negatively skewed ................................................................................5
Figure 2 Types of Kurtosis.........................................................................................................6
Figure 3 Histogram....................................................................................................................6
Figure 4 box plot........................................................................................................................8
Figure 5 Cumulative frequency for male and females.............................................................19
Figure 6 qualification of persons .............................................................................................20
Figure 7 accident statistics.......................................................................................................21
Figure 8 3D pie chart indicating qualification of person .........................................................22
Figure 9 histogram showing speed of vehicles........................................................................23
Figure 10 kernel density plot ...................................................................................................24
Figure 11 box plot by gender...................................................................................................25
4.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 4
i. Objective-
To perform basic calculations, descriptive statistics, frequency distribution and plotting various
graphs using R software.
ii. Software Used-
iii. Concept and Theory
3.1 R language-
R is a language and environment for statistical computing and graphics. R provides a wide
variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series
analysis, classification, clustering,) and graphical techniques, and is highly extensible. The S
language is often the vehicle of choice for research in statistical methodology, and R provides
an Open Source route to participation in that activity.
3.2 Using R as a calculator
R can be used as a powerful calculator by entering equations directly at the prompt in the
command console. Simply type your arithmetic expression and press ENTER. R will evaluate
the expressions and respond with the result.
3.3 Descriptive Statistics.
3.3.1. Mean:
Mean(x) provides the values of arithmetic mean of the data in data vector x.
3.3.2. Median:
Median is the value which divides the observations into two equal parts
At least 50% of the values are greater than or equal to the median and
At least 50% of the values are less than or equal to the median
Median is the better average than arithmetic mean in case of Extreme observations.
3.3.3 Mode-
Mode is the value which occurs more frequently in a set of observations
Distributions having one mode are called unimodal and one with two modes are called
bimodal
3.3.4. Range:
The Range is the difference between the lowest and highest values.
5.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 5
3.3.5 Class Interval:
Class Interval =
Where N is total number of data (Count).
3.3.6 Standard Deviation:
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a
set of values. A low standard deviation indicates that the values tend to be close to the mean of
the set, while a high standard deviation indicates that the values are spread out over a wider
range.
3.3.7 Variance:
In statistics, variance is the expectation of the squared deviation of a random variable from its
mean. Informally, it measures how far a set of numbers is spread out from their average value.
3.3.8 Skewness:
Skewness is a measure of the degree of asymmetry of a frequency distribution. In general,
when the distribution stretches to the right more than it does to the left, it can be said that the
distribution is right-skewed, or positively skewed. When a distribution is right skewed, the
mean is to the right of the median, which in turn is to the right of the mode. The opposite is
true for left-skewed distribution.
I. Positively skewed (right skewed)
II. Negatively skewed (left skewed)
Figure 1 positively and negatively skewed
3.3.9 Kurtosis:
Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from
the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a given
distribution contain extreme values.
i. Leptokurtic: It is a curve having peak than normal curve. Too much concentration
of the items near the center. (kurtosis value >3)
Range
1+3.222 * log10 (N)
6.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 6
ii. Platy-kurtic: A curve having a lower peak (flatter) than the normal curve. There is
less concentration of items near the center. (kurtosis value < 3)
iii. Meso-kurtic: It is a curve having a normal peak or normal curve. There is equal
distribution around the center value (mean). (kurtosis value = 3)
Figure 2 Types of Kurtosis
3.3.10 Histogram:
A histogram is a graphical representation of the distribution of data, which is an estimate of the
probability distribution of a continuous variable, usually in bar graph form. The shape of a
histogram describes how the scores are distributed from low to high. Taller Bars in the
histogram indicate more data points are clustered around that point.
Figure 3 Histogram
7.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 7
3.4 Frequency distribution Curve:
Frequency distribution, in statistics, is a graph or data set organized to show the frequency of
occurrence of each possible outcome of a repeatable event observed many times.
3.5 Ogive curve/ S curve/ Cumulative frequency curve:
It is the representation of the cumulative frequencies for the classes in the frequency
distribution.
3.6 Graphical tools
Graphics summarise the information contained in a data. They have advantage that they convey
the information hidden inside the data more compactly.
Various types of plots:
3.6.1 - Bar diagrams-
It visualizes the relative or absolute frequencies of observed values of a variable
It consists of one bar for each category.
Width of the bar is immaterial.
3.6.2 Pie diagrams-
Pie charts visualize the absolute and relative frequency
3.6.3 Histogram-
Histogram is based on the idea to categorize the data into different groups and plots the
bars of each category with height.
Data is continuous
The area of bars (=height x width) is proportional to the frequency or relative frequency
3.6.4 Kernel density-
It is a smooth curve and represents data distribution
Kernel based on normal distribution is called “Gaussian kernel”. This is the default
kernel in R software.
3.6.7 Stem and leaf plots etc…
Stem and leaf plots show the absolute frequency in different classes like frequency
distribution table or histogram
More suitable for small datasets
Stem and leaf plots is a sort of tabular presentation where each data value is split it into
a stem(the first digit) and a leaf(usually last digit)
Example: “56” is split in to “5” stem and “6” leaf
Stem produces a stem and leaf plot of the values in X. the parameter scale can be used
to expand the scale of the plot
Usage: Stem(x, scale=1)
8.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 8
o Scale> controls the plot length
3.6.8 Boxplots-
Boxplot is a graph which summarizes the distribution of variable by using its median,
quartiles, minimum and maximum values.
Figure 4 box plot
3.7 Quantiles-
Partitions the data in to proportions
3.7.1 Quartiles: the values that divides the given data into four equal parts, say Q1, Q2,Q3,Q4
– Q1 25%
– Q2 50% (median)
– Q3 75%
– Q4 100%
3.7.2 Deciles: the values which divides the given data into 10 equal parts , Say
D1,D2,…..D10
– D1 10%
– D2 20%
– ..
– D10 100%
3.7.3 Percentiles: the values that divide the given data in to 100 equal parts, say P1, P2…P100
– P1 1%
– P22%
– ..
– P100 100%
9.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 9
iv. Procedure:
The following is the procedure followed for the analysis:
i. The given data is imported to R studio and it attached using the code> Attach
(filename).
ii. Some of the data has been typed manually in r studio.
iii. Perform basic calculations, descriptive statistics, frequency distribution and
plotting various graphs using various types of codes.
v. Data Analysis:
5.1 Basic calculations (codes)-
Addition: > 2+3
[1] 5
Multiplication: > 2*3
[1] 6
Subtraction: > 2-3
[1] -1
Division: > 2/3
[1] 0.6666
Cube root: > 2^3 or 2**3
[1] 8
Square- c(2,3,5,7)^2 for square of 2,3,5,6
[1] 4 9 25 49
■ > c(2,3,5,7) ^ c(2,3) for 2 square, 3 cube, 5 square and 7 cube
[1] 4 27 25 343
■ >c(2,3,5,7)* c(8,9) for 2x8, 3x9, 5x8, 7x9
[1] 16 27 40 63
■ > c(2,3,5,7)+ c(8,9) for 2+8, 3+9, 5+8, 7+9
10.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 10
[1] 10 12 13 16
Maximum (max): max (values)
>max (1.2, 3.4,-7.8) or max(c (1.2, 3.4,-7.8))
[1] 3.4
Minimum (min): min (values)
>min (1.2, 3.4,-7.8)
[1] -7.8
5.2 Other functions. (Codes)
Absolute value – abs()
Square root – sqrt()
Rounding – round()
Sum and product – sum(), prod()
Exponential – exp()
Trigonometric functions – sin (), cos () etc.
Hyperbolic functions – sinh(), cosh() etc.
5.3 Missing data, Quantiles and descriptive statistics (codes)-
Table 1 DESCRIPTIVE STATISTICS
Variable Mean Median Mode Std. Dev
Pspeed 1.090429 1.08 1.02, 1.2 0.234301
Vspeed 15.31603 14.06 10.20 4.402192
Vgap 5.058886 4.52 6.48 2.373786
Wtime 8.394057 4.57 10.84 8.209393
time.na=c(NA, 45, 83, 74, 55, 66)
> time.na
[1] NA 45 83 74 55 66
# For mean calculation-
> mean (time.na, na.rm=T)
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 18
> female=MYDATA[MYDATA$Gen==0,]
> View(male)
# for male data
> mydata_male=count(male,'Pspeed')
> View(mydata_male)
> cumu1m=cumsum(mydata_male$freq)
> cumu1percm=cumu1m/nrow(male)
> mydata_male=cbind(mydata_male,cumu1percm)
> View(mydata_male)
#for female data
> View(female)
> mydata_female=count(female,'Pspeed')
> View(mydata_female)
> cumu1f=cumsum(mydata_female$freq)
> cumu1percf=cumu1f/nrow(female)
> mydata_female=cbind(mydata_female,cumu1percf)
> View(mydata_female)
> ggplot()+geom_line(data =
mydata_male,aes(x=Pspeed,y=cumu1percm),color="red")+geom_line(data =
mydata_female,aes(x=Pspeed,y=cumu1percf),color="blue")
> lgd= scale_color_manual("legend",values = c(male="red", female="blue"))
> ggplot()+geom_line(data =
mydata_male,aes(x=Pspeed,y=cumu1percm,color="male"),size=1.3) + geom_line(data =
mydata_female,aes(x=Pspeed,y=cumu1percf,color="female"),size=1.3)+lgd+xlab("Pedestria
n speed")+ylab("cumulative frequency")+ggtitle("CDF curves for male and female")
19.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 19
Figure 5 Cumulative frequency for male and females.
5.5 Graphics and plots-
#FOR BAR GRAPH PLOT
Example: code of qualification of 10 persons by using, say 1 for graduate (G) and 2 for Non-
graduate (N)
G N G N G G G N G G
1 2 1 2 1 1 1 2 1 1
quali=c(1,2,1,2,1,1,1,2,1,1)
> quali
[1] 1 2 1 2 1 1 1 2 1 1
> barplot(quali)
> barplot(table(quali))
20.
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 20
> barplot(table(quali)/length(quali))
#to give title to the graph
> barplot(table(quali),main = "qualification of persons")
#to further add legends and axis to the graph
> barplot(table(quali),main = "qualification of persons",names.arg=c("graduate","non
graduate"), xlab="qualification",ylab = "no. of persons",col=c("blue","green"),ylim = c(0,10))
Figure 6 qualification of persons
#subdivided bar plot by using matrix command.
Example: the Data on the number of accidents at 3 locations during 10-11 am on 4
consecutive days.
No. of accidents Location 1 Location 2 Location 3
Day 1 10 20 30
Day 2 26 53 40
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 22
#FOR PIE CHARTS-
Example: code of qualification of 10 persons by using, say 1 for graduate (G) and 2 for Non-
graduate (N)
>pie(table(quali))
> table(quali)
quali
1 2
7 3
> pie(table(quali))
#3d pie chart
>install.packages("plotrix")
>library(plotrix)
> pie3D(table(quali))
> pie3D(table(quali),explode = 0.2, labels=c("grauduate", "non-graduate"),main =
"qualifcation of person",)
Figure 8 3D pie chart indicating qualification of person
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 24
Figure 10 kernel density plot
# Steam or leaf plot-
> defective=c(46,24,53,44,18,34,65,54,66,35,48,56,73,38,49)
> defective
[1] 46 24 53 44 18 34 65 54 66 35 48 56 73 38 49
> stem(defective,scale=1)
The decimal point is 1 digit(s) to the right of the |
0 | 8
2 | 4458
4 | 4689346
6 | 563
> stem(defective,scale=2)
The decimal point is 1 digit(s) to the right of the |
1 | 8
2 | 4
3 | 458
Transportation Analytics Laboratory
SHRIKRISHNAKESHARWANI (22CEM3R23) 26
[1] 46 24 53 44 18 34 65 54 66 35 48 56 73 38 49
> skewness(defective)
[1] -0.1701834
> kurtosis(defective)
[1] 2.38949
6. Results & Discussion:
Cumulative frequency curves graph shows the avg. speed of female pedestrians are mode
than male pedestrians.
Plotted Bar graph and pie chart shows the no. of graduates are more than no. of non-
graduates.
Subdivided bar plot for road accidents shows that there are higher number of road accidents
in lane 3.
7. Conclusion:
1. R Programming is the best mechanism for statistics and data analysis for transport
engineers.
2. Various types of graphs can easily be plotted with the help of r studio.
3. R software is designed to handle larger data sets, to be reproducible and to create more
detailed visualization.
4. Although R is a popular language used by many programmers, it is especially effective
when used for
Data analysis
Statistical inference
Machine learning algorithms.
Reference
1. https://web.cs.ucla.edu/~gulzar/rstudio/basic-tutorial.html.
2. https://cran.r-project.org/doc/contrib/usingR.pdf.