DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND
DATA MINING
UNIT-II
KISHORE KUMAR M
UNIT-II
kishore.mamidala@gmail.com
2
UNIT- II
Contents:
 Association Rules:
 Problem Definition
 Frequent Item set Generation
 The APRIORI Principle, Support and Confidence Measures,
Association Rule Generation; APRIORI Algorithm,
 The Partition Algorithms, FP-Growth Algorithms,
 Compact Representation of Frequent Item set – Maximal
Frequent Item set, Closed Frequent Item set.
Mining Frequent Patterns
Frequent patterns are patterns (such as itemsets, subsequences, or substructures)
that appear in a data set frequently. For example, a set of items, such as milk and bread,
that appear frequently together in a transaction data set is a frequent itemset. A
subsequence, such as buying first a PC, then a digital camera, and then a memory card, if
it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
Finding such frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
Frequent pattern mining searches for recurring relationships in a given data set.
The basic concepts of frequent pattern mining for the discovery of interesting
associations and correlations between itemsets in transactional and relational databases.
 Association rule mining: Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in transaction databases, relational
databases, and other information repositories.
―Body -> Head [support, confidence]‖.
 Applications:
Basket data analysis, cross-marketing, catalog design, loss-leader analysis,
clustering, classification, etc.
Example.
major(x, ―CS‖) ^ takes(x, ―DB‖) -> grade(x, ―A‖) [1%, 75%]
UNIT-II
kishore.mamidala@gmail.com
3
 Market Basket Analysis: ( an example of market basket analysis, the earliest form of
frequent pattern mining for association rules)
A typical example of frequent itemset mining is market basket analysis. This
process analyzes customer buying habits by finding associations between the different
items that customers place in their ―shopping baskets‖ (Figure). The discovery of such
associations can help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers. For instance, if customers
are buying milk, how likely are they to also buy bread (and what kind of bread) on the
same trip to the supermarket? Such information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.
For example, the information that customers who purchase computers also tend to
buy antivirus software at the same time is represented in Association Rule.
Rule support and confidence are two measures of rule interestingness. A support
of 2% for Association Rule means that 2% of all the transactions under analysis show
that computer and antivirus software are purchased together. A confidence of 60%
means that 60% of the customers who purchased a computer also bought the software.
Fig. Market Basket Analysis
UNIT-II
kishore.mamidala@gmail.com
4
o Frequent Itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.
o Strong Association Rules: From the frequent itemsets these rules must satisfy
minimum support and minimum confidence.
o Maximal Frequent Itemsets: There is one frequent itemset as a maximal
frequent itemset because it has a frequent super-set.
o Closed Frequent Itemsets: The set of closed frequent itemsets contains complete
information regarding the frequent itemsets.
A frequent itemset contains all the subsets as frequent itemsets. We can reduce
the number of frequent itemsets(subsets) generated in the first step of frequent
itemset mining by using closed frequent itemsets and maximal frequent
itemsets. The closed itemset can be defined as a set that has no proper superset
with the same support count as the given datset. It refers to a closed frequent
itemset if the itemset satisfies the least support count. For an itemset to be a
maximal frequent itemset, which is also called Max-Itemset the set needs to
be frequent, but it must not have frequent proper super-itemset in the same
dataset.
 Market basket analysis is just one form of frequent pattern mining. In fact, there are many
kinds of frequent patterns, association rules, and correlation relationships. Frequent
pattern mining can be classified in various ways, based on the following criteria:
o Based on the completeness of patterns to be mined: As we discussed in the previous
subsection, we can mine the complete set of frequent itemsets, the closed frequent
itemsets, and the maximal frequent itemsets, given a minimum support threshold.
o Based on the levels of abstraction involved in the rule set: Some methods for
association rule mining can find rules at differing levels of abstraction.
o Based on the number of data dimensions involved in the rule: If the items or
attributes in an association rule reference only one dimension, then it is a single-
dimensional association rule. If a rule references two or more dimensions, such as the
dimensions age, income, and buys, then it is a multidimensional association rule.
o Based on the types of values handled in the rule: If a rule involves associations
between the presence or absence of items, it is a Boolean association rule. If a rule
describes associations between quantitative items or attributes, then it is a quantitative
association rule.
o Based on the kinds of rules to be mined: Frequent pattern analysis can generate
various kinds of rules and other interesting relationships. Association rules are the
most popular kind of rules generated from frequent patterns.
o Based on the kinds of patterns to be mined: Many kinds of frequent patterns can be
mined from different kinds of data sets. For this chapter, our focus is on frequent
UNIT-II
kishore.mamidala@gmail.com
5
itemset mining, that is, the mining of frequent itemsets (sets of items) from
transactional or relational data sets. However, other kinds of frequent patterns can be
found from other kinds of data sets.
Efficient and Scalable Frequent Itemset Mining Methods
It specifies the simplest form of frequent patterns—single-dimensional, single-level,
Boolean frequent itemsets, such as those discussed for market basket analysis. Apriori,
the basic algorithm for finding frequent itemsets, how to generate strong association rules
from frequent itemsets. And methods for mining frequent itemsets that, unlike Apriori, do
not involve the generation of ―candidate‖ frequent itemsets.
 The Apriori Algorithm: Finding Frequent Item sets Using Candidate Generation
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for
mining frequent itemsets for Boolean association rules. The name of the algorithm is
based on the fact that the algorithm uses prior knowledge of frequent itemset
properties, as we shall see following. Apriori employs an iterative approach known as
a level-wise search, where k-itemsets are used to explore (k+1)-itemsets.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
A two-step process is followed, consisting of join and prune actions.
1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1
with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1.
2. The prune step: Ck is a superset of Lk, that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in Ck. A scan of the database
to determine the count of each candidate in Ck would result in the determination of Lk
(i.e., all candidates having a count no less than the minimum support count are
frequent by definition, and therefore belong to Lk). Ck, however, can be huge, and so
this could involve heavy computation. To reduce the size of Ck, the Apriori property is
used.
 Algorithm: Apriori find frequent itemsets using an iterative level-wise approach
based on candidate generation.
Input:
D, a database of transactions;
min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
Method:
o Join Step: Ck is generated by joining Lk-1with itself
o Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset
o Pseudo-code:
UNIT-II
kishore.mamidala@gmail.com
6
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
 To Generate Candidates:
o Suppose the items in Lk-1 are listed in an order
o Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
o Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Example Apriori. Let’s look at a concrete example, based on the AllElectronics
transaction database, D, of Table . There are nine transactions in this database, that is,
|D| = 9. We use the Apriori algorithm for finding frequent itemsets in D.
UNIT-II
kishore.mamidala@gmail.com
7
Figure Generation of candidate itemsets & frequent itemsets, where the minimum support count is 2.
1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-
itemsets, C1. The algorithm simply scans all of the transactions in order to count the
number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join to
generate a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no
candidates are removed fromC2 during the prune step because each subset of the
candidates is also frequent.
UNIT-II
kishore.mamidala@gmail.com
8
4. Next, the transactions inDare scanned and the support count of each candidate itemset
in C2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets, Based on the Apriori
property that all subsets of a frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent.We therefore remove them
fromC3.
7. The transactions in D are scanned in order to determine L3, consisting of those
candidate 3-itemsets in C3 having minimum support.
8. The algorithm uses to generate a candidate set of 4-itemsets, C4. Although the
join results in {I1, I2, I3, I5}, this itemset is pruned because its subset {I2, I3,I5} is not
frequent. Thus, C4 = NULL.
 Generating Association Rules from Frequent Itemsets
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong
association rules satisfy both minimum support and minimum confidence).
The conditional probability is expressed in terms of itemset support count, where
support count(AUB) is the number of transactions containing the itemsets AUB, and
support count(A) is the number of transactions containing the itemset.
Example: Generating association rules. Let’s try an example based on the
transactional data for AllElectronics shown in Table(above) . Suppose the data contain
the frequent itemset l = {I1, I2, I5}. What are the association rules that can be
generated from l? The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2},
and {I5}. The resulting association rules are as shown below, each listed with its
confidence:
If the minimum confidence threshold is, say, 70%, then only the second, third, and last
rules above are output, because these are the only ones generated that are strong.
 Improving the Efficiency of Apriori
Many variations of the Apriori algorithm have been proposed that focus on improving
the efficiency of the original algorithm.
UNIT-II
kishore.mamidala@gmail.com
9
Hash-based technique(hashing itemsets into corresponding buckets): A hash-based
technique can be used to reduce the size of the candidate k-itemsets, Ck, for k > 1. For
example, when scanning each transaction in the database to generate the frequent 1-
itemsets, L1, from the candidate 1-itemsets in C1, we can generate all of the 2-itemsets
for each transaction, hash (i.e., map) them into the different buckets of a hash table
structure, and increase the corresponding bucket counts.
Transaction reduction (reducing the number of transactions scanned in future
iterations): A transaction that does not contain any frequent k-itemsets cannot contain
any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed
from further consideration because subsequent scans of the database for j-itemsets,
where j > k, will not require it.
Partitioning (partitioning the data to find candidate itemsets): A partitioning
technique can be used that requires just two database scans to mine the frequent
itemsets. It consists of two phases. In Phase I, the algorithm subdivides the
transactions of D into n non overlapping partitions. Therefore, all local frequent
itemsets are candidate itemsets with respect to D are identified. In Phase II, a second
scan of D is conducted in which the actual support of each candidate is assessed in
order to determine the global frequent itemsets.
Figure Mining by partitioning the data.
Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D. The sample size of S is such that the search for
frequent itemsets in S can be done in main memory, and so only one scan of the
transactions in S is required overall. Because we are searching for frequent itemsets in
S rather than in D, it is possible that we will miss some of the global frequent itemsets.
Dynamic itemset counting (adding candidate itemsets at different points during a
scan): A dynamic itemset counting technique was proposed in which the database is
partitioned into blocks marked by start points. In this variation, new candidate itemsets
can be added at any start point, unlike in Apriori, which determines new candidate
itemsets only immediately before each complete database scan.
 Mining Frequent Itemsets without Candidate Generation
The Apriori candidate generate-and-test method significantly reduces the size of
candidate sets, leading to good performance gain. It can suffer from two nontrivial
costs:
UNIT-II
kishore.mamidala@gmail.com
10
o It may need to generate a huge number of candidate sets.
o It may need to repeatedly scan the database and check a large set of
candidates by pattern matching.
 FP-growth :
An interesting method in this attempt is called frequent-pattern growth, or simply
FP-growth, which adopts a divide-and-conquer strategy as follows. First, it
compresses the database representing frequent items into a frequent-pattern tree, or
FP-tree,which retains the itemset association information. It then divides the
compressed database into a set of conditional databases, each associated with one
frequent item or ―pattern fragment,‖ and mines each such database separately.
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
 Mining Frequent Patterns Using FP-tree:
 General idea (divide-and-conquer)
o Recursively grow frequent pattern path using the FP-tree
 Method
o For each item, construct its conditional pattern-base, and then its
conditional FP-tree
o Repeat the process on each newly created conditional FP-tree
o Until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)
 Major Steps to Mine FP-tree
1. Construct conditional pattern base for each node in the FP-tree
2. Construct conditional FP-tree from each conditional pattern-base
3. Recursively mine conditional FP-trees and grow frequent patterns obtained so far
If the conditional FP-tree contains a single path, simply enumerate all the patterns
Step 1: From FP-tree to Conditional Pattern Base
o Starting at the frequent header table in the FP-tree
o Traverse the FP-tree by following the link of each frequent item
UNIT-II
kishore.mamidala@gmail.com
11
o Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
 Properties of FP-tree for Conditional Pattern Base Construction
 Node-link property
o For any frequent item ai, all the possible frequent patterns that contain ai can
be obtained by following ai's node-links, starting from ai's head in the FP-
tree header
 Prefix path property
o To calculate the frequent patterns for a node ai in a path P, only the prefix
sub-path of ai in P need to be accumulated, and its frequency count should
carry the same count as node ai.
Step 2: Construct Conditional FP-tree
 For each pattern-base
o Accumulate the count for each item in the base
o Construct the FP-tree for the frequent items of the pattern base
Step 3: Recursively mine the conditional FP-tree
Single FP-tree Path Generation
 Suppose an FP-tree T has a single path P
 The complete set of frequent pattern of T can be generated by enumeration of all
the combinations of the sub-paths of P
 Principles of Frequent Pattern Growth
 Pattern growth property
o Let  be a frequent itemset in DB, B be 's conditional pattern base, and 
be an itemset in B. Then    is a frequent itemset in DB iff  is
frequent in B.
 ―abcdef ‖ is a frequent pattern, if and only if
o ―abcde ‖ is a frequent pattern, and
o ―f ‖ is frequent in the set of transactions containing ―abcde ‖
 Why Is Frequent Pattern Growth Fast?
 Our performance study shows
UNIT-II
kishore.mamidala@gmail.com
12
o FP-growth is an order of magnitude faster than Apriori, and is also faster
than tree-projection
 Reasoning
o No candidate generation, no candidate test
o Use compact data structure
 Example FP-growth (finding frequent itemsets without candidate generation). We re-
examine the mining of transaction database, D, of Table using the frequent pattern
growth approach.
UNIT-II
kishore.mamidala@gmail.com
13
The first scan of the database is the same as Apriori, which derives the set of frequent
items (1-itemsets) and their support counts (frequencies). Let the minimum support
count be 2. The set of frequent items is sorted in the order of descending support
count. This resulting set or list is denoted L. Thus, we have L ={I2: 7}, {I1: 6}, {I3:
6}, {I4: 2}, {I5: 2}.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled
with ―null.‖ Scan database D a second time.
Figure An FP-tree registers compressed, frequent pattern information.
The FP-tree is mined as follows. Start from each frequent length-1 pattern, construct
its conditional pattern base, then construct its (conditional) FP-tree, and perform
mining recursively on such a tree. Mining of the FP-tree is summarized in Table.
UNIT-II
kishore.mamidala@gmail.com
14
The FP-growth method transforms the problem of finding long frequent patterns
to searching for shorter ones recursively and then concatenating the suffix. It uses the
least frequent items as a suffix, offering good selectivity. The method substantially
reduces the search costs.
 Mining Frequent Itemsets Using Vertical Data Format
Both the Apriori and FP-growth methods mine frequent patterns from a set of
transactions in TID-itemset format (that is, fTID : itemsetg), where TID is a
transaction-id and itemset is the set of items bought in transaction TID. This data
format is known as horizontal data format. Alternatively, data can also be presented in
item-TID set format (that is, fitem : TID setg), where item is an item name, and TID set
is the set of transaction identifiers containing the item. This format is known as
vertical data format. Frequent itemsets can also be mined efficiently using vertical data
format, which is the essence of the ECLAT (Equivalence CLASS Transformation)
algorithm developed by Zaki.
Example Mining frequent itemsets using vertical data format. Consider the horizontal
data format of the transaction database, D, of Table.
This illustrates the process of mining frequent itemsets by exploring the vertical
data format. First, we transform the horizontally formatted data to the vertical format
by scanning the data set once. The support count of an itemset is simply the length of
the TID set of the itemset. Starting with k = 1, the frequent k-itemsets can be used to
construct the candidate (k+1)-itemsets based on the Apriori property. The computation
is done by intersection of the TID sets of the frequent k-itemsets to compute the TID
sets of the corresponding (k+1)-itemsets. This process repeats, with k incremented by
1 each time, until no frequent itemsets or no candidate itemsets can be found.
UNIT-II
kishore.mamidala@gmail.com
15
 Mining Closed Frequent Itemsets
A recommended methodology is to search for closed frequent itemsets directly during
the mining process. This requires us to prune the search space as soon as we can
identify the case of closed itemsets during mining. Pruning strategies include the
following:
Item merging: If every transaction containing a frequent itemset X also contains an
itemset Y but not any proper superset of Y, then X U Y forms a frequent closed itemset
and thereis no need to search for any itemset containing X but no Y.
Sub-itemset pruning: If a frequent itemset X is a proper subset of an already found
frequent closed itemset Y and support count(X) = support count(Y), then X and all of
X’s descendants in the set enumeration tree cannot be frequent closed itemsets and
thus can be pruned.
Item skipping: In the depth-first mining of closed itemsets, at each level, there will be
a prefix itemset X associated with a header table and a projected database. If a local
frequent item p has the same support in several header tables at different levels, we
can safely prune p from the header tables at higher levels.
Mining Various Kinds of Association Rules
We have studied efficient methods for mining frequent itemsets and association
rules. In this section, we consider additional application requirements by extending our
scope to include mining multilevel association rules, multidimensional association rules,
and quantitative association rules in transactional and/or relational databases and data
warehouses.
 Mining Multilevel Association Rules
UNIT-II
kishore.mamidala@gmail.com
16
Data mining systems should provide capabilities for mining association rules at
multiple levels of abstraction, with sufficient flexibility for easy traversal among
different abstraction spaces.
Example: A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher level, more general concepts. Data can be generalized by replacing
low-level concepts within the data by their higher-level concepts, or ancestors, from a
concept hierarchy.
The concept hierarchy of Figure has five levels, respectively referred to as levels
0 to 4, starting with level 0 at the root node for all (the most general abstraction level).
Here, level 1 includes computer, software, printer&camera, and computer accessory,
level 2 includes laptop computer, desktop computer, office software, antivirus
software, . . . , and level 3 includes IBM desktop computer, . . . , Microsoft office
software, and so on. Level 4 is the most specific abstraction level of this hierarchy.
Association rules generated from mining data at multiple levels of abstraction are
called multiple-level or multilevel association rules. Multilevel association rules can
be mined efficiently using concept hierarchies under a support-confidence framework.
o Using uniform minimum support for all levels (referred to as uniform
support): The same minimum support threshold is used when mining at each
level of abstraction. For example, in Figure (below), a minimum support
UNIT-II
kishore.mamidala@gmail.com
17
threshold of 5% is used throughout (e.g., for mining from “computer” down
to “laptop computer”). Both “computer” and “laptop computer” are found to
be frequent, while “desktop computer” is not.
Figure Multilevel mining with uniform support
o Using reduced minimum support at lower levels (referred to as reduced
support): Each level of abstraction has its own minimum support threshold.
The deeper the level of abstraction, the smaller the corresponding threshold is.
For example, in Figure (below), the minimum support thresholds for levels 1
and 2 are 5% and 3%, respectively. In this way, “computer,” “laptop
computer,” and “desktop computer” are all considered frequent.
Figure Multilevel mining with reduced support.
o Using item or group-based minimum support (referred to as group-based
support): Because users or experts often have insight as to which groups are
more important than others, it is sometimes more desirable to set up user-
specific, item, or groupbased minimal support thresholds when mining
multilevel rules. For example, a user could set up the minimum support
thresholds based on product price, or on items of interest, such as by setting
particularly low support thresholds for laptop computers and flash drives in
order to pay particular attention to the association patterns containing items in
these categories
 Mining Multidimensional Association Rules from Relational Databases
and DataWarehouses:
UNIT-II
kishore.mamidala@gmail.com
18
For instance, in mining our AllElectronics database, we may discover the Boolean
association rule.
Each distinct predicate in a rule as a dimension. Hence, we can refer to Rule (1) as a
single dimensional or intra dimensional association rule because it contains a single
distinct predicate (e.g., buys) with multiple occurrences (i.e., the predicate occurs more
than once within the rule).
Additional relational information regarding the customers who purchased the
items, such as customer age, occupation, credit rating, income, and address, may also
be stored. Considering each database attribute or warehouse dimension as a predicate,
we can therefore mine association rules containing multiple predicates, such as
Association rules that involve two or more dimensions or predicates can be referred to
as multidimensional association rules. Multidimensional association rules with no
repeated predicates are called inter dimensional association rules.
Multidimensional association rules with repeated predicates, which contain
multiple occurrences of some predicates are called hybrid-dimensional association
rules.
Note: Database attributes can be categorical or quantitative. Categorical attributes
have a finite number of possible values, with no ordering among the values (e.g.,
occupation, brand, color). Categorical attributes are also called nominal attributes,
because their values are ―names of things.‖ Quantitative attributes are numeric and
have an implicit ordering among values (e.g., age, income, price).
 Based on kind of attributes techniques for mining multidimensional association rules
can be categorized into two.
In the first approach, quantitative attributes are discretized using predefined
concept hierarchies. This discretization occurs before mining. For instance, a concept
hierarchy for income may be used to replace the original numeric values of this
attribute by interval labels, such as ―0.. . 20K‖, ―21K . . 30K‖, ―31K . . 40K‖, and so
on. Here, discretization is static and predetermined. This known as mining
multidimensional association rules using static discretization of quantitative
attributes.
In the second approach, quantitative attributes are discretized or clustered into
“bins” based on the distribution of the data. These bins may be further combined
during the mining process. The discretization process is dynamic and established so as
to satisfy some mining criteria, such as maximizing the confidence of the rules mined.
Because this strategy treats the numeric attribute values as quantities rather than as
predefined ranges or categories, association rules mined from this approach are also
referred to as (dynamic) quantitative association rules.
UNIT-II
kishore.mamidala@gmail.com
19
Figure Lattice of cuboids, making up a 3-D data cube. Each cuboid represents a
different group-by. The base cuboid contains the three predicates age, income, and
buys.
Mining Quantitative Association Rules: Quantitative association rules are
multidimensional association rules in which the numeric attributes are dynamically
discretized during the mining process so as to satisfy some mining criteria, such as
maximizing the confidence or compactness of the rules mined.
To mine quantitative association rules having two quantitative attributes on the
left-hand side of the rule and one categorical attribute on the right-hand side of the
rule.
An example of such a two-dimensional quantitative association rule is
“To find such rules” We have a system called ARCS (Association Rule Clustering
System), which borrows ideas from image processing. Essentially, this approach maps
pairs of quantitative attributes onto a 2-D grid for tuples satisfying a given categorical
attribute condition. The grid is then searched for clusters of points from which the
association rules are generated. The following steps are involved in ARCS:
Binning: The partitioning process is referred to as binning, that is, where the intervals
are considered ―bins.‖ Three common binning strategies area as follows:
Equal-width binning, where the interval size of each bin is the same.
Equal-frequency binning, where each bin has approximately the same number of
tuples assigned to it,
Clustering-based binning, where clustering is performed on the quantitative attribute
to group neighboring points into the same bin.
UNIT-II
kishore.mamidala@gmail.com
20
The same 2-D array can be used to generate rules for any value of the categorical
attribute, based on the same two quantitative attributes.
Finding frequent predicate sets: Once the 2-D array containing the count distribution
for each category is set up, it can be scanned to find the frequent predicate sets that
also satisfy minimum confidence. Strong association rules can then be generated from
these predicate sets.
Clustering the association rules: The strong association rules obtained in the
previous step are then mapped to a 2-D grid. Figure (below) shows a 2-D grid for 2-D
quantitative association rules predicting the condition buys(X, “HDTV”) on the rule
right-hand side, given the quantitative attributes age and income. The four Xs
correspond to the rules.
Figure A 2-D grid for tuples representing customers who purchase high-definition TVs.
ARCS employs a clustering algorithm for this purpose. The algorithm scans the grid,
searching for rectangular clusters of rules.
From Association Mining to Correlation Analysis
The support and confidence measures are insufficient at filtering out uninteresting
association rules. To tackle this weakness, a correlation measure can be used to augment
the support-confidence framework for association rules. This leads to correlation rules of
the form
That is, a correlation rule is measured not only by its support and confidence but also by
the correlation between itemsets A and B.
UNIT-II
kishore.mamidala@gmail.com
21
There are several methods to determine correlation analysis. They are follows:
 Lift is a simple correlation measure that is given as follows. The occurrence of itemset
A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B); otherwise,
itemsets A and B are dependent and correlated as events. The lift between the
occurrence of A and B can be measured by computing
· If the result is less than 1, then the occurrence of A is negatively correlated
with the occurrence of B.
· If the result is greater than 1, then A and B are positively correlated, meaning
that the occurrence of one implies the occurrence of the other.
· If the result is equal to 1, then A and B are independent and there is no
correlation between them.
From the table, we can see that the probability of purchasing a computer game
is P(game) = 0:60, the probability of purchasing a video is P(video) = 0:75,
and the probability of purchasing both is P(game; video) = 0:40.
P(game, video) / (P(game) X P(video)) = 0:40 / (0:60 X 0:75) = 0.89. Because
this value is less than 1, there is a negative correlation between the occurrence
of {game} and {video}.
 Correlation analysis using 2.
To compute the correlation using 2 analysis, we need the observed value and
expected value (displayed in parenthesis) for each slot of the contingency table, as shown
in Table . From the table, we can compute the 2 value as follows:
UNIT-II
kishore.mamidala@gmail.com
22
Because the 2 value is greater than one, and the observed value of the slot
(game, video) = 4,000, which is less than the expected value 4,500, buying game and
buying video are negatively correlated.
 ALL CONFIDENCE: In this method given an itemset X = {i1, i2,….ik}, the all
confidence of X is defined as
 COSINE: In this method to measure the attribute relevance,
Comparison of four correlation measures on typical data sets
UNIT-II
kishore.mamidala@gmail.com
23
Constraint-Based Association Mining:
A data mining process may uncover thousands of rules from a given set of data,
most of which end up being unrelated or uninteresting to the users. To find out the
interesting patterns constraints are included. This strategy is known as constraint-based
mining. The constraints can include the following:
 Knowledge type constraints: These specify the type of knowledge to be mined,
such as association or correlation.
 Data constraints: These specify the set of task-relevant data.
 Dimension/level constraints: These specify the desired dimensions (or attributes) of
the data, or levels of the concept hierarchies, to be used in mining.
 Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and correlation.
 Rule constraints: These specify the form of rules to be mined. Such constraints may
be expressed as meta rules (rule templates), as the maximum or minimum number of
predicates that can occur in the rule antecedent or consequent, or as relationships
among attributes, attribute values, and/or aggregates.
The above constraints can be specified using a high-level declarative data mining
query language and user interface.
 Meta Rule-Guided Mining of Association Rules: Meta rules allow users to
specify the syntactic form of rules that they are interested in mining. The rule forms
can be used as constraints to help improve the efficiency of the mining process. Meta
rules may be based on the analyst’s experience, expectations, or intuition regarding the
data or may be automatically generated based on the database schema.
A meta rule can be used to specify this information describing the form of rules
you are interested in finding. An example of such a meta rule is
UNIT-II
kishore.mamidala@gmail.com
24
In general, a meta rule forms a hypothesis regarding the relationships that the user
is interested in probing or confirming.
A meta rule is a rule template of the form
 Constraint Pushing: Mining Guided by Rule Constraints: Rule constraints
specify expected set/subset relationships of the variables in the mined rules, constant
initiation of variables, and aggregate functions.
Our association mining query is to “Find the sales of which cheap items (where the
sum of the prices is less than $100) may promote the sales of which expensive items
(where the minimum price is $500) of the same group for Chicago customers in
2004.” This can be expressed in the DMQL data mining query language as follows,
Rule constraints can be classified into the following five categories with respect to
frequent itemset mining: (1) antimonotonic, (2) monotonic, (3) succinct, (4)
convertible, and (5) inconvertible
UNIT-II
kishore.mamidala@gmail.com
25
Constraints belonging to the first four of these categories can be used during frequent
itemset mining to guide the process, leading to more efficient and effective mining.
 A constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of
the super-patterns of S can satisfy Ca
 sum(S.Price)  v is anti-monotone
 sum(S.Price)  v is not anti-monotone
 sum(S.Price) = v is partly anti-monotone
 A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-
pattern of S also satisfies it.
 Succinctness:
 For any set S1 and S2 satisfying C, S1  S2 satisfies C
 Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are
based on A1 , i.e., it contains a subset belongs to A1 ,
Example :
 sum(S.Price )  v is not succinct
 min(S.Price )  v is succinct
 Convertible Constraint
 Suppose all items in patterns are listed in a total order R
 A constraint C is convertible anti-monotone iff a pattern S satisfying the
constraint implies that each suffix of S w.r.t. R also satisfies C
 A constraint C is convertible monotone iff a pattern S satisfying the
constraint implies that each pattern of which S is a suffix w.r.t. R also
satisfies C
Example:
Let R be the value descending order over the set of items
o E.g. I={9, 8, 6, 4, 3, 1}
Avg(S)  v is convertible monotone w.r.t. R
o If S is a suffix of S1, avg(S1)  avg(S)
 {8, 4, 3} is a suffix of {9, 8, 4, 3}
 avg({9, 8, 4, 3})=6  avg({8, 4, 3})=5
o If S satisfies avg(S) v, so does S1
{8, 4, 3} satisfies constraint avg(S)  4, so does {9, 8, 4, 3}

Dm unit ii r16

  • 1.
    DATA WAREHOUSING ANDDATA MINING DATA WAREHOUSING AND DATA MINING UNIT-II KISHORE KUMAR M
  • 2.
    UNIT-II kishore.mamidala@gmail.com 2 UNIT- II Contents:  AssociationRules:  Problem Definition  Frequent Item set Generation  The APRIORI Principle, Support and Confidence Measures, Association Rule Generation; APRIORI Algorithm,  The Partition Algorithms, FP-Growth Algorithms,  Compact Representation of Frequent Item set – Maximal Frequent Item set, Closed Frequent Item set. Mining Frequent Patterns Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data set frequently. For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. Finding such frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data. Frequent pattern mining searches for recurring relationships in a given data set. The basic concepts of frequent pattern mining for the discovery of interesting associations and correlations between itemsets in transactional and relational databases.  Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. ―Body -> Head [support, confidence]‖.  Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Example. major(x, ―CS‖) ^ takes(x, ―DB‖) -> grade(x, ―A‖) [1%, 75%]
  • 3.
    UNIT-II kishore.mamidala@gmail.com 3  Market BasketAnalysis: ( an example of market basket analysis, the earliest form of frequent pattern mining for association rules) A typical example of frequent itemset mining is market basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their ―shopping baskets‖ (Figure). The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers do selective marketing and plan their shelf space. For example, the information that customers who purchase computers also tend to buy antivirus software at the same time is represented in Association Rule. Rule support and confidence are two measures of rule interestingness. A support of 2% for Association Rule means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together. A confidence of 60% means that 60% of the customers who purchased a computer also bought the software. Fig. Market Basket Analysis
  • 4.
    UNIT-II kishore.mamidala@gmail.com 4 o Frequent Itemsets:By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. o Strong Association Rules: From the frequent itemsets these rules must satisfy minimum support and minimum confidence. o Maximal Frequent Itemsets: There is one frequent itemset as a maximal frequent itemset because it has a frequent super-set. o Closed Frequent Itemsets: The set of closed frequent itemsets contains complete information regarding the frequent itemsets. A frequent itemset contains all the subsets as frequent itemsets. We can reduce the number of frequent itemsets(subsets) generated in the first step of frequent itemset mining by using closed frequent itemsets and maximal frequent itemsets. The closed itemset can be defined as a set that has no proper superset with the same support count as the given datset. It refers to a closed frequent itemset if the itemset satisfies the least support count. For an itemset to be a maximal frequent itemset, which is also called Max-Itemset the set needs to be frequent, but it must not have frequent proper super-itemset in the same dataset.  Market basket analysis is just one form of frequent pattern mining. In fact, there are many kinds of frequent patterns, association rules, and correlation relationships. Frequent pattern mining can be classified in various ways, based on the following criteria: o Based on the completeness of patterns to be mined: As we discussed in the previous subsection, we can mine the complete set of frequent itemsets, the closed frequent itemsets, and the maximal frequent itemsets, given a minimum support threshold. o Based on the levels of abstraction involved in the rule set: Some methods for association rule mining can find rules at differing levels of abstraction. o Based on the number of data dimensions involved in the rule: If the items or attributes in an association rule reference only one dimension, then it is a single- dimensional association rule. If a rule references two or more dimensions, such as the dimensions age, income, and buys, then it is a multidimensional association rule. o Based on the types of values handled in the rule: If a rule involves associations between the presence or absence of items, it is a Boolean association rule. If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. o Based on the kinds of rules to be mined: Frequent pattern analysis can generate various kinds of rules and other interesting relationships. Association rules are the most popular kind of rules generated from frequent patterns. o Based on the kinds of patterns to be mined: Many kinds of frequent patterns can be mined from different kinds of data sets. For this chapter, our focus is on frequent
  • 5.
    UNIT-II kishore.mamidala@gmail.com 5 itemset mining, thatis, the mining of frequent itemsets (sets of items) from transactional or relational data sets. However, other kinds of frequent patterns can be found from other kinds of data sets. Efficient and Scalable Frequent Itemset Mining Methods It specifies the simplest form of frequent patterns—single-dimensional, single-level, Boolean frequent itemsets, such as those discussed for market basket analysis. Apriori, the basic algorithm for finding frequent itemsets, how to generate strong association rules from frequent itemsets. And methods for mining frequent itemsets that, unlike Apriori, do not involve the generation of ―candidate‖ frequent itemsets.  The Apriori Algorithm: Finding Frequent Item sets Using Candidate Generation Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall see following. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. Apriori property: All nonempty subsets of a frequent itemset must also be frequent. A two-step process is followed, consisting of join and prune actions. 1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. 2. The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck, however, can be huge, and so this could involve heavy computation. To reduce the size of Ck, the Apriori property is used.  Algorithm: Apriori find frequent itemsets using an iterative level-wise approach based on candidate generation. Input: D, a database of transactions; min sup, the minimum support count threshold. Output: L, frequent itemsets in D. Method: o Join Step: Ck is generated by joining Lk-1with itself o Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset o Pseudo-code:
  • 6.
    UNIT-II kishore.mamidala@gmail.com 6 Ck: Candidate itemsetof size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;  To Generate Candidates: o Suppose the items in Lk-1 are listed in an order o Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 o Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Example Apriori. Let’s look at a concrete example, based on the AllElectronics transaction database, D, of Table . There are nine transactions in this database, that is, |D| = 9. We use the Apriori algorithm for finding frequent itemsets in D.
  • 7.
    UNIT-II kishore.mamidala@gmail.com 7 Figure Generation ofcandidate itemsets & frequent itemsets, where the minimum support count is 2. 1. In the first iteration of the algorithm, each item is a member of the set of candidate 1- itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item. 2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying minimum support. 3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join to generate a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no candidates are removed fromC2 during the prune step because each subset of the candidates is also frequent.
  • 8.
    UNIT-II kishore.mamidala@gmail.com 8 4. Next, thetransactions inDare scanned and the support count of each candidate itemset in C2 is accumulated. 5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2- itemsets in C2 having minimum support. 6. The generation of the set of candidate 3-itemsets, Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent.We therefore remove them fromC3. 7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support. 8. The algorithm uses to generate a candidate set of 4-itemsets, C4. Although the join results in {I1, I2, I3, I5}, this itemset is pruned because its subset {I2, I3,I5} is not frequent. Thus, C4 = NULL.  Generating Association Rules from Frequent Itemsets Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence). The conditional probability is expressed in terms of itemset support count, where support count(AUB) is the number of transactions containing the itemsets AUB, and support count(A) is the number of transactions containing the itemset. Example: Generating association rules. Let’s try an example based on the transactional data for AllElectronics shown in Table(above) . Suppose the data contain the frequent itemset l = {I1, I2, I5}. What are the association rules that can be generated from l? The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}. The resulting association rules are as shown below, each listed with its confidence: If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules above are output, because these are the only ones generated that are strong.  Improving the Efficiency of Apriori Many variations of the Apriori algorithm have been proposed that focus on improving the efficiency of the original algorithm.
  • 9.
    UNIT-II kishore.mamidala@gmail.com 9 Hash-based technique(hashing itemsetsinto corresponding buckets): A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck, for k > 1. For example, when scanning each transaction in the database to generate the frequent 1- itemsets, L1, from the candidate 1-itemsets in C1, we can generate all of the 2-itemsets for each transaction, hash (i.e., map) them into the different buckets of a hash table structure, and increase the corresponding bucket counts. Transaction reduction (reducing the number of transactions scanned in future iterations): A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it. Partitioning (partitioning the data to find candidate itemsets): A partitioning technique can be used that requires just two database scans to mine the frequent itemsets. It consists of two phases. In Phase I, the algorithm subdivides the transactions of D into n non overlapping partitions. Therefore, all local frequent itemsets are candidate itemsets with respect to D are identified. In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Figure Mining by partitioning the data. Sampling (mining on a subset of the given data): The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall. Because we are searching for frequent itemsets in S rather than in D, it is possible that we will miss some of the global frequent itemsets. Dynamic itemset counting (adding candidate itemsets at different points during a scan): A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan.  Mining Frequent Itemsets without Candidate Generation The Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. It can suffer from two nontrivial costs:
  • 10.
    UNIT-II kishore.mamidala@gmail.com 10 o It mayneed to generate a huge number of candidate sets. o It may need to repeatedly scan the database and check a large set of candidates by pattern matching.  FP-growth : An interesting method in this attempt is called frequent-pattern growth, or simply FP-growth, which adopts a divide-and-conquer strategy as follows. First, it compresses the database representing frequent items into a frequent-pattern tree, or FP-tree,which retains the itemset association information. It then divides the compressed database into a set of conditional databases, each associated with one frequent item or ―pattern fragment,‖ and mines each such database separately. Steps: 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in frequency descending order 3. Scan DB again, construct FP-tree  Mining Frequent Patterns Using FP-tree:  General idea (divide-and-conquer) o Recursively grow frequent pattern path using the FP-tree  Method o For each item, construct its conditional pattern-base, and then its conditional FP-tree o Repeat the process on each newly created conditional FP-tree o Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)  Major Steps to Mine FP-tree 1. Construct conditional pattern base for each node in the FP-tree 2. Construct conditional FP-tree from each conditional pattern-base 3. Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns Step 1: From FP-tree to Conditional Pattern Base o Starting at the frequent header table in the FP-tree o Traverse the FP-tree by following the link of each frequent item
  • 11.
    UNIT-II kishore.mamidala@gmail.com 11 o Accumulate allof transformed prefix paths of that item to form a conditional pattern base  Properties of FP-tree for Conditional Pattern Base Construction  Node-link property o For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP- tree header  Prefix path property o To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai. Step 2: Construct Conditional FP-tree  For each pattern-base o Accumulate the count for each item in the base o Construct the FP-tree for the frequent items of the pattern base Step 3: Recursively mine the conditional FP-tree Single FP-tree Path Generation  Suppose an FP-tree T has a single path P  The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P  Principles of Frequent Pattern Growth  Pattern growth property o Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset in B. Then    is a frequent itemset in DB iff  is frequent in B.  ―abcdef ‖ is a frequent pattern, if and only if o ―abcde ‖ is a frequent pattern, and o ―f ‖ is frequent in the set of transactions containing ―abcde ‖  Why Is Frequent Pattern Growth Fast?  Our performance study shows
  • 12.
    UNIT-II kishore.mamidala@gmail.com 12 o FP-growth isan order of magnitude faster than Apriori, and is also faster than tree-projection  Reasoning o No candidate generation, no candidate test o Use compact data structure  Example FP-growth (finding frequent itemsets without candidate generation). We re- examine the mining of transaction database, D, of Table using the frequent pattern growth approach.
  • 13.
    UNIT-II kishore.mamidala@gmail.com 13 The first scanof the database is the same as Apriori, which derives the set of frequent items (1-itemsets) and their support counts (frequencies). Let the minimum support count be 2. The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted L. Thus, we have L ={I2: 7}, {I1: 6}, {I3: 6}, {I4: 2}, {I5: 2}. An FP-tree is then constructed as follows. First, create the root of the tree, labeled with ―null.‖ Scan database D a second time. Figure An FP-tree registers compressed, frequent pattern information. The FP-tree is mined as follows. Start from each frequent length-1 pattern, construct its conditional pattern base, then construct its (conditional) FP-tree, and perform mining recursively on such a tree. Mining of the FP-tree is summarized in Table.
  • 14.
    UNIT-II kishore.mamidala@gmail.com 14 The FP-growth methodtransforms the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix. It uses the least frequent items as a suffix, offering good selectivity. The method substantially reduces the search costs.  Mining Frequent Itemsets Using Vertical Data Format Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in TID-itemset format (that is, fTID : itemsetg), where TID is a transaction-id and itemset is the set of items bought in transaction TID. This data format is known as horizontal data format. Alternatively, data can also be presented in item-TID set format (that is, fitem : TID setg), where item is an item name, and TID set is the set of transaction identifiers containing the item. This format is known as vertical data format. Frequent itemsets can also be mined efficiently using vertical data format, which is the essence of the ECLAT (Equivalence CLASS Transformation) algorithm developed by Zaki. Example Mining frequent itemsets using vertical data format. Consider the horizontal data format of the transaction database, D, of Table. This illustrates the process of mining frequent itemsets by exploring the vertical data format. First, we transform the horizontally formatted data to the vertical format by scanning the data set once. The support count of an itemset is simply the length of the TID set of the itemset. Starting with k = 1, the frequent k-itemsets can be used to construct the candidate (k+1)-itemsets based on the Apriori property. The computation is done by intersection of the TID sets of the frequent k-itemsets to compute the TID sets of the corresponding (k+1)-itemsets. This process repeats, with k incremented by 1 each time, until no frequent itemsets or no candidate itemsets can be found.
  • 15.
    UNIT-II kishore.mamidala@gmail.com 15  Mining ClosedFrequent Itemsets A recommended methodology is to search for closed frequent itemsets directly during the mining process. This requires us to prune the search space as soon as we can identify the case of closed itemsets during mining. Pruning strategies include the following: Item merging: If every transaction containing a frequent itemset X also contains an itemset Y but not any proper superset of Y, then X U Y forms a frequent closed itemset and thereis no need to search for any itemset containing X but no Y. Sub-itemset pruning: If a frequent itemset X is a proper subset of an already found frequent closed itemset Y and support count(X) = support count(Y), then X and all of X’s descendants in the set enumeration tree cannot be frequent closed itemsets and thus can be pruned. Item skipping: In the depth-first mining of closed itemsets, at each level, there will be a prefix itemset X associated with a header table and a projected database. If a local frequent item p has the same support in several header tables at different levels, we can safely prune p from the header tables at higher levels. Mining Various Kinds of Association Rules We have studied efficient methods for mining frequent itemsets and association rules. In this section, we consider additional application requirements by extending our scope to include mining multilevel association rules, multidimensional association rules, and quantitative association rules in transactional and/or relational databases and data warehouses.  Mining Multilevel Association Rules
  • 16.
    UNIT-II kishore.mamidala@gmail.com 16 Data mining systemsshould provide capabilities for mining association rules at multiple levels of abstraction, with sufficient flexibility for easy traversal among different abstraction spaces. Example: A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher level, more general concepts. Data can be generalized by replacing low-level concepts within the data by their higher-level concepts, or ancestors, from a concept hierarchy. The concept hierarchy of Figure has five levels, respectively referred to as levels 0 to 4, starting with level 0 at the root node for all (the most general abstraction level). Here, level 1 includes computer, software, printer&camera, and computer accessory, level 2 includes laptop computer, desktop computer, office software, antivirus software, . . . , and level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on. Level 4 is the most specific abstraction level of this hierarchy. Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework. o Using uniform minimum support for all levels (referred to as uniform support): The same minimum support threshold is used when mining at each level of abstraction. For example, in Figure (below), a minimum support
  • 17.
    UNIT-II kishore.mamidala@gmail.com 17 threshold of 5%is used throughout (e.g., for mining from “computer” down to “laptop computer”). Both “computer” and “laptop computer” are found to be frequent, while “desktop computer” is not. Figure Multilevel mining with uniform support o Using reduced minimum support at lower levels (referred to as reduced support): Each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the corresponding threshold is. For example, in Figure (below), the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all considered frequent. Figure Multilevel mining with reduced support. o Using item or group-based minimum support (referred to as group-based support): Because users or experts often have insight as to which groups are more important than others, it is sometimes more desirable to set up user- specific, item, or groupbased minimal support thresholds when mining multilevel rules. For example, a user could set up the minimum support thresholds based on product price, or on items of interest, such as by setting particularly low support thresholds for laptop computers and flash drives in order to pay particular attention to the association patterns containing items in these categories  Mining Multidimensional Association Rules from Relational Databases and DataWarehouses:
  • 18.
    UNIT-II kishore.mamidala@gmail.com 18 For instance, inmining our AllElectronics database, we may discover the Boolean association rule. Each distinct predicate in a rule as a dimension. Hence, we can refer to Rule (1) as a single dimensional or intra dimensional association rule because it contains a single distinct predicate (e.g., buys) with multiple occurrences (i.e., the predicate occurs more than once within the rule). Additional relational information regarding the customers who purchased the items, such as customer age, occupation, credit rating, income, and address, may also be stored. Considering each database attribute or warehouse dimension as a predicate, we can therefore mine association rules containing multiple predicates, such as Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association rules. Multidimensional association rules with no repeated predicates are called inter dimensional association rules. Multidimensional association rules with repeated predicates, which contain multiple occurrences of some predicates are called hybrid-dimensional association rules. Note: Database attributes can be categorical or quantitative. Categorical attributes have a finite number of possible values, with no ordering among the values (e.g., occupation, brand, color). Categorical attributes are also called nominal attributes, because their values are ―names of things.‖ Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income, price).  Based on kind of attributes techniques for mining multidimensional association rules can be categorized into two. In the first approach, quantitative attributes are discretized using predefined concept hierarchies. This discretization occurs before mining. For instance, a concept hierarchy for income may be used to replace the original numeric values of this attribute by interval labels, such as ―0.. . 20K‖, ―21K . . 30K‖, ―31K . . 40K‖, and so on. Here, discretization is static and predetermined. This known as mining multidimensional association rules using static discretization of quantitative attributes. In the second approach, quantitative attributes are discretized or clustered into “bins” based on the distribution of the data. These bins may be further combined during the mining process. The discretization process is dynamic and established so as to satisfy some mining criteria, such as maximizing the confidence of the rules mined. Because this strategy treats the numeric attribute values as quantities rather than as predefined ranges or categories, association rules mined from this approach are also referred to as (dynamic) quantitative association rules.
  • 19.
    UNIT-II kishore.mamidala@gmail.com 19 Figure Lattice ofcuboids, making up a 3-D data cube. Each cuboid represents a different group-by. The base cuboid contains the three predicates age, income, and buys. Mining Quantitative Association Rules: Quantitative association rules are multidimensional association rules in which the numeric attributes are dynamically discretized during the mining process so as to satisfy some mining criteria, such as maximizing the confidence or compactness of the rules mined. To mine quantitative association rules having two quantitative attributes on the left-hand side of the rule and one categorical attribute on the right-hand side of the rule. An example of such a two-dimensional quantitative association rule is “To find such rules” We have a system called ARCS (Association Rule Clustering System), which borrows ideas from image processing. Essentially, this approach maps pairs of quantitative attributes onto a 2-D grid for tuples satisfying a given categorical attribute condition. The grid is then searched for clusters of points from which the association rules are generated. The following steps are involved in ARCS: Binning: The partitioning process is referred to as binning, that is, where the intervals are considered ―bins.‖ Three common binning strategies area as follows: Equal-width binning, where the interval size of each bin is the same. Equal-frequency binning, where each bin has approximately the same number of tuples assigned to it, Clustering-based binning, where clustering is performed on the quantitative attribute to group neighboring points into the same bin.
  • 20.
    UNIT-II kishore.mamidala@gmail.com 20 The same 2-Darray can be used to generate rules for any value of the categorical attribute, based on the same two quantitative attributes. Finding frequent predicate sets: Once the 2-D array containing the count distribution for each category is set up, it can be scanned to find the frequent predicate sets that also satisfy minimum confidence. Strong association rules can then be generated from these predicate sets. Clustering the association rules: The strong association rules obtained in the previous step are then mapped to a 2-D grid. Figure (below) shows a 2-D grid for 2-D quantitative association rules predicting the condition buys(X, “HDTV”) on the rule right-hand side, given the quantitative attributes age and income. The four Xs correspond to the rules. Figure A 2-D grid for tuples representing customers who purchase high-definition TVs. ARCS employs a clustering algorithm for this purpose. The algorithm scans the grid, searching for rectangular clusters of rules. From Association Mining to Correlation Analysis The support and confidence measures are insufficient at filtering out uninteresting association rules. To tackle this weakness, a correlation measure can be used to augment the support-confidence framework for association rules. This leads to correlation rules of the form That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets A and B.
  • 21.
    UNIT-II kishore.mamidala@gmail.com 21 There are severalmethods to determine correlation analysis. They are follows:  Lift is a simple correlation measure that is given as follows. The occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A)P(B); otherwise, itemsets A and B are dependent and correlated as events. The lift between the occurrence of A and B can be measured by computing · If the result is less than 1, then the occurrence of A is negatively correlated with the occurrence of B. · If the result is greater than 1, then A and B are positively correlated, meaning that the occurrence of one implies the occurrence of the other. · If the result is equal to 1, then A and B are independent and there is no correlation between them. From the table, we can see that the probability of purchasing a computer game is P(game) = 0:60, the probability of purchasing a video is P(video) = 0:75, and the probability of purchasing both is P(game; video) = 0:40. P(game, video) / (P(game) X P(video)) = 0:40 / (0:60 X 0:75) = 0.89. Because this value is less than 1, there is a negative correlation between the occurrence of {game} and {video}.  Correlation analysis using 2. To compute the correlation using 2 analysis, we need the observed value and expected value (displayed in parenthesis) for each slot of the contingency table, as shown in Table . From the table, we can compute the 2 value as follows:
  • 22.
    UNIT-II kishore.mamidala@gmail.com 22 Because the 2value is greater than one, and the observed value of the slot (game, video) = 4,000, which is less than the expected value 4,500, buying game and buying video are negatively correlated.  ALL CONFIDENCE: In this method given an itemset X = {i1, i2,….ik}, the all confidence of X is defined as  COSINE: In this method to measure the attribute relevance, Comparison of four correlation measures on typical data sets
  • 23.
    UNIT-II kishore.mamidala@gmail.com 23 Constraint-Based Association Mining: Adata mining process may uncover thousands of rules from a given set of data, most of which end up being unrelated or uninteresting to the users. To find out the interesting patterns constraints are included. This strategy is known as constraint-based mining. The constraints can include the following:  Knowledge type constraints: These specify the type of knowledge to be mined, such as association or correlation.  Data constraints: These specify the set of task-relevant data.  Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or levels of the concept hierarchies, to be used in mining.  Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support, confidence, and correlation.  Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as meta rules (rule templates), as the maximum or minimum number of predicates that can occur in the rule antecedent or consequent, or as relationships among attributes, attribute values, and/or aggregates. The above constraints can be specified using a high-level declarative data mining query language and user interface.  Meta Rule-Guided Mining of Association Rules: Meta rules allow users to specify the syntactic form of rules that they are interested in mining. The rule forms can be used as constraints to help improve the efficiency of the mining process. Meta rules may be based on the analyst’s experience, expectations, or intuition regarding the data or may be automatically generated based on the database schema. A meta rule can be used to specify this information describing the form of rules you are interested in finding. An example of such a meta rule is
  • 24.
    UNIT-II kishore.mamidala@gmail.com 24 In general, ameta rule forms a hypothesis regarding the relationships that the user is interested in probing or confirming. A meta rule is a rule template of the form  Constraint Pushing: Mining Guided by Rule Constraints: Rule constraints specify expected set/subset relationships of the variables in the mined rules, constant initiation of variables, and aggregate functions. Our association mining query is to “Find the sales of which cheap items (where the sum of the prices is less than $100) may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago customers in 2004.” This can be expressed in the DMQL data mining query language as follows, Rule constraints can be classified into the following five categories with respect to frequent itemset mining: (1) antimonotonic, (2) monotonic, (3) succinct, (4) convertible, and (5) inconvertible
  • 25.
    UNIT-II kishore.mamidala@gmail.com 25 Constraints belonging tothe first four of these categories can be used during frequent itemset mining to guide the process, leading to more efficient and effective mining.  A constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-patterns of S can satisfy Ca  sum(S.Price)  v is anti-monotone  sum(S.Price)  v is not anti-monotone  sum(S.Price) = v is partly anti-monotone  A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super- pattern of S also satisfies it.  Succinctness:  For any set S1 and S2 satisfying C, S1  S2 satisfies C  Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 , i.e., it contains a subset belongs to A1 , Example :  sum(S.Price )  v is not succinct  min(S.Price )  v is succinct  Convertible Constraint  Suppose all items in patterns are listed in a total order R  A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies C  A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w.r.t. R also satisfies C Example: Let R be the value descending order over the set of items o E.g. I={9, 8, 6, 4, 3, 1} Avg(S)  v is convertible monotone w.r.t. R o If S is a suffix of S1, avg(S1)  avg(S)  {8, 4, 3} is a suffix of {9, 8, 4, 3}  avg({9, 8, 4, 3})=6  avg({8, 4, 3})=5 o If S satisfies avg(S) v, so does S1 {8, 4, 3} satisfies constraint avg(S)  4, so does {9, 8, 4, 3}