Introduction to Big Data & Basic Data Analysis 
Mohammad Reza Gerami 
gerami@aryatadbir.com 
mrgerami@aut.ac.ir
Big Data EveryWhere! 
• Lots of data is being collected 
and warehoused 
– Web data, e-commerce 
– purchases at department/ 
grocery stores 
– Bank/Credit Card 
transactions 
– Social Network
How much data? 
•Google processes 20 PB a day (2008) 
•Wayback Machine has 3 PB + 100 TB/month (3/2009) 
•Facebook has 2.5 PB of user data + 15 TB/day (4/2009) 
•eBay has 6.5 PB of user data + 50 TB/day (5/2009) 
•CERN’s Large HydronCollider (LHC) generates 15 PB a year 
640Kought to be enough for anybody.
MaximilienBrice, © CERN
The Earthscope 
•The Earthscopeis the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ- -uI)
Type of Data 
•Relational Data (Tables/Transaction/Legacy Data) 
•Text Data (Web) 
•Semi-structured Data (XML) 
•Graph Data 
–Social Network, Semantic Web (RDF), … 
•Streaming Data 
–You can only scan the data once
What to do with these data? 
•Aggregation and Statistics 
–Data warehouse and OLAP 
•Indexing, Searching, and Querying 
–Keyword based search 
–Pattern matching (XML/RDF) 
•Knowledge discovery 
–Data Mining 
–Statistical Modeling
Statistics 101
Random Sample and Statistics 
•Population:is used to refer to the set or universe of all entities under study. 
•However, looking at the entire population may not be feasible, or may be too expensive. 
•Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.
Statistic 
•Let Si denote the random variable corresponding to data point xi , then a statisticˆθ is a function ˆθ : (S1, S2, · · · , Sn) → R. 
•If we use the value of a statistic to estimate a population parameter, this value is called a point estimateof the parameter, and the statistic is called as an estimatorof the parameter.
Empirical Cumulative Distribution Function 
Where 
Inverse Cumulative Distribution Function
Example
Measures of Central Tendency (Mean) 
Population Mean: 
Sample Mean (Unbiased, not robust):
Measures of Central Tendency (Median) 
Population Median: 
or 
Sample Median:
Example
Measures of Dispersion (Range) 
Range: 
Not robust, sensitive to extreme values 
Sample Range:
Measures of Dispersion (Inter-Quartile Range) 
Inter-Quartile Range (IQR): 
More robust 
Sample IQR:
Measures of Dispersion (Variance and Standard Deviation) 
Standard Deviation: 
Variance:
Measures of Dispersion (Variance and Standard Deviation) 
Standard Deviation: 
Variance: 
Sample Variance & Standard Deviation:
Univariate Normal Distribution
Multivariate Normal Distribution
OLAP and Data Mining
Warehouse Architecture 
23 
Client 
Client 
Warehouse 
Source 
Source 
Source 
Query & Analysis 
Integration 
Metadata
24 
Star Schemas 
•A star schemais a common organization for data at a warehouse. It consists of: 
1.Fact table: a very large accumulation of facts such as sales. 
Often “insert-only.” 
2.Dimension tables: smaller, generally static information about the entities involved in the facts.
Terms 
• Fact table 
• Dimension tables 
• Measures 
25 
sale 
orderId 
date 
custId 
prodId 
storeId 
qty 
amt 
customer 
custId 
name 
address 
city 
product 
prodId 
name 
price 
store 
storeId 
city
Star 
26 
customer custId name address city 
53 joe 10 main sfo 
81 fred 12 main sfo 
111 sally 80 willow la 
product prodId name price 
p1 bolt 10 
p2 nut 5 
store storeId city 
c1 nyc 
c2 sfo 
c3 la 
sale oderId date custId prodId storeId qty amt 
o100 1/7/97 53 p1 c1 1 12 
o102 2/7/97 53 p2 c1 2 11 
105 3/8/97 111 p1 c3 5 50
Cube 
27 
sale prodId storeId amt 
p1 c1 12 
p2 c1 11 
p1 c3 50 
p2 c2 8 
c1 c2 c3 
p1 12 50 
p2 11 8 
Fact table view: 
Multi-dimensional cube: 
dimensions = 2
3-D Cube 
28 
sale prodId storeId date amt 
p1 c1 1 12 
p2 c1 1 11 
p1 c3 1 50 
p2 c2 1 8 
p1 c1 2 44 
p1 c2 2 4 
day 2 c1 c2 c3 
p1 44 4 
p2 c1 c2 c3 
p1 12 50 
p2 11 8 
day 1 
dimensions = 3 
Fact table view: Multi-dimensional cube:
ROLAP vs. MOLAP 
•ROLAP: Relational On-Line Analytical Processing 
•MOLAP: Multi-Dimensional On-Line Analytical Processing 
29
Aggregates 
30 
sale prodId storeId date amt 
p1 c1 1 12 
p2 c1 1 11 
p1 c3 1 50 
p2 c2 1 8 
p1 c1 2 44 
p1 c2 2 4 
• Add up amounts for day 1 
• In SQL: SELECT sum(amt) FROM SALE 
WHERE date = 1 
81
Aggregates 
31 
sale prodId storeId date amt 
p1 c1 1 12 
p2 c1 1 11 
p1 c3 1 50 
p2 c2 1 8 
p1 c1 2 44 
p1 c2 2 4 
• Add up amounts by day 
• In SQL: SELECT date, sum(amt) FROM SALE 
GROUP BY date 
ans date sum 
1 81 
2 48
Another Example 
32 
sale prodId storeId date amt 
p1 c1 1 12 
p2 c1 1 11 
p1 c3 1 50 
p2 c2 1 8 
p1 c1 2 44 
p1 c2 2 4 
• Add up amounts by day, product 
• In SQL: SELECT date, sum(amt) FROM SALE 
GROUP BY date, prodId 
sale prodId date amt 
p1 1 62 
p2 1 19 
p1 2 48 
drill-down 
rollup
Aggregates 
•Operators: sum, count, max, min, median, ave 
•“Having” clause 
•Using dimension hierarchy 
–average by region (within store) 
–maximum by month (within date) 
33
What is Data Mining? 
•Discovery of useful, possibly unexpected, patterns in data 
•Non-trivial extraction of implicit, previously unknown and potentially useful information from data 
•Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
Data Mining Tasks 
•Classification [Predictive] 
•Clustering [Descriptive] 
•Association Rule Discovery [Descriptive] 
•Sequential Pattern Discovery [Descriptive] 
•Regression [Predictive] 
•Deviation Detection [Predictive] 
•Collaborative Filter [Predictive]
Classification: Definition 
•Given a collection of records (training set ) 
–Each record contains a set of attributes, one of the attributes is the class. 
•Find a modelfor class attribute as a function of the values of other attributes. 
•Goal: previously unseenrecords should be assigned a class as accurately as possible. 
–A test setis used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Decision Trees 
37 
sale custId car age city newCar 
c1 taurus 27 sf yes 
c2 van 35 la yes 
c3 van 40 sf yes 
c4 taurus 22 sf yes 
c5 merc 50 la no 
c6 taurus 25 la no 
Example: 
• Conducted survey to see what customers were 
interested in new model car 
• Want to select customers for advertising campaign 
training 
set
Clustering 
38 
age 
income 
education
K-Means Clustering 
39
Association Rule Mining 
40 
tran1 cust33 p2, p5, p8 
tran2 cust45 p5, p8, p11 
tran3 cust12 p1, p9 
tran4 cust40 p5, p8, p11 
tran5 cust12 p2, p9 
tran6 cust12 p9 
sales 
records: 
• Trend: Products p5, p8 often bough together 
• Trend: Customer 12 likes product p9 
market-basket 
data
Association Rule Discovery 
•Marketing and Sales Promotion: 
–Let the rule discovered be 
{Bagels, … } --> {Potato Chips} 
–Potato Chipsas consequent=> Can be used to determine what should be done to boost its sales. 
–Bagels in the antecedent=> can be used to see which products would be affected if the store discontinues selling bagels. 
–Bagels in antecedentandPotato chips in consequent=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips! 
•Supermarket shelf management. 
•Inventory Managemnt
Collaborative Filtering 
•Goal: predict what movies/books/… a person may be interested in, on the basis of 
–Past preferences of the person 
–Other people with similar past preferences 
–The preferences of such people for a new movie/book/… 
•One approach based on repeated clustering 
–Cluster people on the basis of preferences for movies 
–Then cluster movies on the basis of being liked by the same clusters of people 
–Again cluster people based on their preferences for (the newly created clusters of) movies 
–Repeat above till equilibrium 
•Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 
42
Other Types of Mining 
•Text mining: application of data mining to textual documents 
–cluster Web pages to find related pages 
–cluster pages a user has visited to organize their visit history 
–classify Web pages automatically into a Web directory 
•Graph Mining: 
–Deal with graph data 
43
Data Streams 
•What are Data Streams? 
–Continuous streams 
–Huge, Fast, and Changing 
•Why Data Streams? 
–The arriving speed of streams and the huge amount of data are beyond our capability to store them. 
–“Real-time” processing 
•Window Models 
–Landscape window (Entire Data Stream) 
–Sliding Window 
–Damped Window 
•Mining Data Stream 
44
A Simple Problem 
•Finding frequent items 
–Given a sequence (x1,…xN) where xi∈[1,m], and a real number θbetween zero and one. 
–Looking for xiwhose frequency > θ 
–Naïve Algorithm (m counters) 
•The number of frequent items ≤ 1/θ 
•Problem: N>>m>>1/θ 
45 
P×(Nθ) ≤ N
KRP algorithm ─ Karp, et. al (TODS’03) 
46 
Θ=0.35 
⌈1/θ⌉ =3 
N=30 
m=12 
N/ (⌈1/θ⌉) ≤ Nθ
Streaming Sample Problem 
•Scan the dataset once 
•Sample K records 
–Each one has equally probability to be sampled 
–Total N record: K/N

Big data

  • 1.
    Introduction to BigData & Basic Data Analysis Mohammad Reza Gerami gerami@aryatadbir.com mrgerami@aut.ac.ir
  • 2.
    Big Data EveryWhere! • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network
  • 3.
    How much data? •Google processes 20 PB a day (2008) •Wayback Machine has 3 PB + 100 TB/month (3/2009) •Facebook has 2.5 PB of user data + 15 TB/day (4/2009) •eBay has 6.5 PB of user data + 50 TB/day (5/2009) •CERN’s Large HydronCollider (LHC) generates 15 PB a year 640Kought to be enough for anybody.
  • 4.
  • 5.
    The Earthscope •TheEarthscopeis the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ- -uI)
  • 6.
    Type of Data •Relational Data (Tables/Transaction/Legacy Data) •Text Data (Web) •Semi-structured Data (XML) •Graph Data –Social Network, Semantic Web (RDF), … •Streaming Data –You can only scan the data once
  • 7.
    What to dowith these data? •Aggregation and Statistics –Data warehouse and OLAP •Indexing, Searching, and Querying –Keyword based search –Pattern matching (XML/RDF) •Knowledge discovery –Data Mining –Statistical Modeling
  • 8.
  • 9.
    Random Sample andStatistics •Population:is used to refer to the set or universe of all entities under study. •However, looking at the entire population may not be feasible, or may be too expensive. •Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.
  • 10.
    Statistic •Let Sidenote the random variable corresponding to data point xi , then a statisticˆθ is a function ˆθ : (S1, S2, · · · , Sn) → R. •If we use the value of a statistic to estimate a population parameter, this value is called a point estimateof the parameter, and the statistic is called as an estimatorof the parameter.
  • 11.
    Empirical Cumulative DistributionFunction Where Inverse Cumulative Distribution Function
  • 12.
  • 13.
    Measures of CentralTendency (Mean) Population Mean: Sample Mean (Unbiased, not robust):
  • 14.
    Measures of CentralTendency (Median) Population Median: or Sample Median:
  • 15.
  • 16.
    Measures of Dispersion(Range) Range: Not robust, sensitive to extreme values Sample Range:
  • 17.
    Measures of Dispersion(Inter-Quartile Range) Inter-Quartile Range (IQR): More robust Sample IQR:
  • 18.
    Measures of Dispersion(Variance and Standard Deviation) Standard Deviation: Variance:
  • 19.
    Measures of Dispersion(Variance and Standard Deviation) Standard Deviation: Variance: Sample Variance & Standard Deviation:
  • 20.
  • 21.
  • 22.
  • 23.
    Warehouse Architecture 23 Client Client Warehouse Source Source Source Query & Analysis Integration Metadata
  • 24.
    24 Star Schemas •A star schemais a common organization for data at a warehouse. It consists of: 1.Fact table: a very large accumulation of facts such as sales. Often “insert-only.” 2.Dimension tables: smaller, generally static information about the entities involved in the facts.
  • 25.
    Terms • Facttable • Dimension tables • Measures 25 sale orderId date custId prodId storeId qty amt customer custId name address city product prodId name price store storeId city
  • 26.
    Star 26 customercustId name address city 53 joe 10 main sfo 81 fred 12 main sfo 111 sally 80 willow la product prodId name price p1 bolt 10 p2 nut 5 store storeId city c1 nyc c2 sfo c3 la sale oderId date custId prodId storeId qty amt o100 1/7/97 53 p1 c1 1 12 o102 2/7/97 53 p2 c1 2 11 105 3/8/97 111 p1 c3 5 50
  • 27.
    Cube 27 saleprodId storeId amt p1 c1 12 p2 c1 11 p1 c3 50 p2 c2 8 c1 c2 c3 p1 12 50 p2 11 8 Fact table view: Multi-dimensional cube: dimensions = 2
  • 28.
    3-D Cube 28 sale prodId storeId date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 day 2 c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8 day 1 dimensions = 3 Fact table view: Multi-dimensional cube:
  • 29.
    ROLAP vs. MOLAP •ROLAP: Relational On-Line Analytical Processing •MOLAP: Multi-Dimensional On-Line Analytical Processing 29
  • 30.
    Aggregates 30 saleprodId storeId date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81
  • 31.
    Aggregates 31 saleprodId storeId date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date ans date sum 1 81 2 48
  • 32.
    Another Example 32 sale prodId storeId date amt p1 c1 1 12 p2 c1 1 11 p1 c3 1 50 p2 c2 1 8 p1 c1 2 44 p1 c2 2 4 • Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId sale prodId date amt p1 1 62 p2 1 19 p1 2 48 drill-down rollup
  • 33.
    Aggregates •Operators: sum,count, max, min, median, ave •“Having” clause •Using dimension hierarchy –average by region (within store) –maximum by month (within date) 33
  • 34.
    What is DataMining? •Discovery of useful, possibly unexpected, patterns in data •Non-trivial extraction of implicit, previously unknown and potentially useful information from data •Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
  • 35.
    Data Mining Tasks •Classification [Predictive] •Clustering [Descriptive] •Association Rule Discovery [Descriptive] •Sequential Pattern Discovery [Descriptive] •Regression [Predictive] •Deviation Detection [Predictive] •Collaborative Filter [Predictive]
  • 36.
    Classification: Definition •Givena collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class. •Find a modelfor class attribute as a function of the values of other attributes. •Goal: previously unseenrecords should be assigned a class as accurately as possible. –A test setis used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  • 37.
    Decision Trees 37 sale custId car age city newCar c1 taurus 27 sf yes c2 van 35 la yes c3 van 40 sf yes c4 taurus 22 sf yes c5 merc 50 la no c6 taurus 25 la no Example: • Conducted survey to see what customers were interested in new model car • Want to select customers for advertising campaign training set
  • 38.
    Clustering 38 age income education
  • 39.
  • 40.
    Association Rule Mining 40 tran1 cust33 p2, p5, p8 tran2 cust45 p5, p8, p11 tran3 cust12 p1, p9 tran4 cust40 p5, p8, p11 tran5 cust12 p2, p9 tran6 cust12 p9 sales records: • Trend: Products p5, p8 often bough together • Trend: Customer 12 likes product p9 market-basket data
  • 41.
    Association Rule Discovery •Marketing and Sales Promotion: –Let the rule discovered be {Bagels, … } --> {Potato Chips} –Potato Chipsas consequent=> Can be used to determine what should be done to boost its sales. –Bagels in the antecedent=> can be used to see which products would be affected if the store discontinues selling bagels. –Bagels in antecedentandPotato chips in consequent=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips! •Supermarket shelf management. •Inventory Managemnt
  • 42.
    Collaborative Filtering •Goal:predict what movies/books/… a person may be interested in, on the basis of –Past preferences of the person –Other people with similar past preferences –The preferences of such people for a new movie/book/… •One approach based on repeated clustering –Cluster people on the basis of preferences for movies –Then cluster movies on the basis of being liked by the same clusters of people –Again cluster people based on their preferences for (the newly created clusters of) movies –Repeat above till equilibrium •Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 42
  • 43.
    Other Types ofMining •Text mining: application of data mining to textual documents –cluster Web pages to find related pages –cluster pages a user has visited to organize their visit history –classify Web pages automatically into a Web directory •Graph Mining: –Deal with graph data 43
  • 44.
    Data Streams •Whatare Data Streams? –Continuous streams –Huge, Fast, and Changing •Why Data Streams? –The arriving speed of streams and the huge amount of data are beyond our capability to store them. –“Real-time” processing •Window Models –Landscape window (Entire Data Stream) –Sliding Window –Damped Window •Mining Data Stream 44
  • 45.
    A Simple Problem •Finding frequent items –Given a sequence (x1,…xN) where xi∈[1,m], and a real number θbetween zero and one. –Looking for xiwhose frequency > θ –Naïve Algorithm (m counters) •The number of frequent items ≤ 1/θ •Problem: N>>m>>1/θ 45 P×(Nθ) ≤ N
  • 46.
    KRP algorithm ─Karp, et. al (TODS’03) 46 Θ=0.35 ⌈1/θ⌉ =3 N=30 m=12 N/ (⌈1/θ⌉) ≤ Nθ
  • 47.
    Streaming Sample Problem •Scan the dataset once •Sample K records –Each one has equally probability to be sampled –Total N record: K/N