Data Compression in Data mining and Business Intelligencs

A. D. Patel Institute Of Technology
Data Mining And Business Intelligence (2170715): A. Y. 2019-20
Data Compression – Numerosity Reduction
Prepared By :
Dhruv V. Shah (160010116053)
B.E. (IT) Sem - VII
Guided By :
Prof. Ravi D. Patel
(Dept Of IT , ADIT)
Department Of Information Technology
A.D. Patel Institute Of Technology (ADIT)
New Vallabh Vidyanagar , Anand , Gujarat
1

Outline
 Introduction
 Data Reduction Strategies
 Numerosity Reduction
 Numerosity Reduction Methods
1) Parametric Methods
1.1) Regression
1.2) Log-Linear Model
2) Non-Parametric Methods
2.1) Histograms
2.2) Clustering
2.3) Sampling
2.4) Data Cube Aggregation.
 References
2

 Why Need Data Reduction?
 A database/data warehouse may store terabytes of data.
 Complex data analysis/mining may take a very long time to run on the complete data set.
3
 Data Reduction:
Introduction
 Data Reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
 That, is Mining on the reduced data set should be more efficient yet produce the same
analytical results.

Data Reduction Strategies
4
 Data cube aggregation
 Attribute Subset Selection
 Numerosity reduction — e.g., fit data into models
 Dimensionality reduction - Data Compression
 Discretization and concept hierarchy generation

Numerosity Reduction
5
 What is Numerosity Reduction?
 These techniques replace the original data volume by alternative, smaller forms of data
representation.
 There are two techniques for numerosity reduction methods.
1) Parametric
2) Non-Parametric

Numerosity Reduction Methods
1) Parametric Methods :
 A model is used to estimate the data, so that only the data parameters need to be restored and
not the actual data.
 It assumes that the data fits some model estimates model parameters.
 The Regression and Log-Linear methods are used for creating such models.
 Regression :
 Regression can be a simple linear regression or multiple linear regression.
 When there is only single independent attribute, such regression model is called simple linear
regression and if there are multiple independent attributes, then such regression models are
called multiple linear regression.
 In linear regression, the data are modeled to a fit straight line.
6

Cont.…
7
 For example,
a random variable y can be modeled as a linear function of another random variable x with the
equation y = ax+b ,where a and b (regression coefficients) specifies the slope and y-intercept of the
line, respectively.
In multiple linear regression, y will be modeled as a linear function of two or more
predictor(independent) variables.
 Log-Linear Model :
 Log-linear model can be used to estimate the probability of each data point in a
multidimensional space for a set of discretized attributes, based on a smaller subset of
dimensional combinations.
 This allows a higher-dimensional data space to be constructed from lower-dimensional
attributes.
 Regression and log-linear model can both be used on sparse data, although their application
may be limited.

2) Non-Parametric Methods :
 Do not assume the data.
 These methods are used for storing reduced representations of the data include histograms,
clustering, sampling and data cube aggregation.
8
Cont.…
1) Histograms :
 Divide data into buckets and store average (sum) for each bucket.
 Partitioning rules:
1) Equal-width:
Equal bucket range
2) Equal-frequency (or equal-depth) :
It uses binning to approximate data distribution

 Binning Method :
 Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Smoothing by bin means:
Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4
Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4
Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4
9
Cont.…
3) V-optimal:
with the least histogram variance (weighted sum of the original values that each
bucket represents)
4) MaxDiff:
Consider difference between pair of adjacent values. Set bucket boundary between
each pair for pairs having the β (No. of buckets)–1 largest differences

Cont….
10
 Multi-dimensional histogram
Fig. Histogram with Singleton buckets

Cont.…
11
Fig. Equal-width Histogram
 List of prices:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18,
20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30

2) Clustering :
 Clustering divides the data into groups/clusters.
 This technique partitions the whole data into different clusters.
 In data reduction, the cluster representation of the data are used to replace the actual data.
 It also helps to detect outliers in data.
12

14
3) Sampling :
 Sampling obtaining a small sample s to represent the whole data set N
 Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the
data
 Choose a representative subset of the data
 Simple random sampling may have very poor performance in the presence of skew
 Develop adaptive sampling methods
 Stratified sampling
 Approximate the percentage of each class (or subpopulation of interest) in the
overall database.
 Used in conjunction with skewed data.
 Sampling may not reduce database I/Os (page at a time).

15
Sampling Techniques :
 Simple Random Sample Without Replacement (SRSWOR)
 Simple Random Sample With Replacement (SRSWR)
 Cluster Sample
 Stratified Sample

Sampling Random Sample with or without Replacement
Fig. SRSWOR & SRSWR
16
Raw Data

Cluster Sample
17
 Tuples are grouped into M mutually disjoint clusters
 SRS of m clusters is taken where m < M
 Tuples in a database retrieved in pages
 Page - Cluster
 SRSWOR to pages

Stratified Sample
18
 Data is divided into mutually disjoint parts called strata
 SRS at each stratum
 Representative samples ensured even in the presence of skewed data

Cluster and Stratified Sampling
19
Fig. Cluster & Stratified Sampling

Features of Sampling :
 Cost depends on size of sample.
 Sub-linear on size of data.
 Linear with respect to dimensions.
 Estimates answer to an aggregate query.
20

21
3) Data Cube Aggregation: :
 A data cube is generally used to easily interpret data. It is especially useful when representing
data together with dimensions as certain measures of business requirements.
 A cube's every dimension represents certain characteristic of the database.
 Data Cubes store multidimensional aggregated information.
 Data cubes provide fast access to precomputed, summarized data, thereby benefiting online
analytical processing (OLAP) as well as data mining.

22
Categories of Data Cube :
 Dimensions:
 Represents categories of data such as time or location.
 Each dimension includes different levels of categories.
 Example :

23
Categories of Data Cube :
 Measures:
 These are the actual data values that occupy the cells as defined by the dimensions selected.
 Measures include facts or variables typically stored as numerical fields.
 Example :

24
References
 https://en.wikipedia.org/wiki/Data_cube
 https://www.geeksforgeeks.org/numerosity-reduction-in-data-mining/
 http://www.lastnightstudy.com/Show?id=44/Data-Reduction-In-Data-Mining

Data Compression in Data mining and Business Intelligencs

More Related Content

What's hot

Similar to Data Compression in Data mining and Business Intelligencs

More from ShahDhruv21

Recently uploaded

Data Compression in Data mining and Business Intelligencs