Lecture 08
Information Retrieval
The Vector Space Model for Scoring
Introduction
 The representation of a set of documents as vectors in a common
vector space is known as the vector space model and is
fundamental to a host of information retrieval operations ranging
from scoring documents on a query, document classification and
document clustering.
 We first develop the basic ideas underlying vector space scoring; a
pivotal step in this development is the view of queries as vectors
in the same vector space as the document collection.
The Vector Space Model for Scoring
Dot Products
 We denote by 𝑽(𝒅)the vector derived from document 𝒅, with one
component in the vector for each dictionary term.
 Unless otherwise specified, you may assume that the components are
computed using the 𝒕𝒇 − 𝒊𝒅𝒇 weighting scheme, although the
particular weighting scheme is immaterial to the discussion that
follows.
 The set of documents in a collection then may be viewed as a set of
vectors in a vector space, in which there is one axis for each term.
 So we have a |V|- dimensional vector space
 Terms are axes of the space
 Documents are points or vectors in this space
 Very high-dimensional: tens of millions of dimensions when you apply
this to a web search engine
 These are very sparse vectors – most entries are zero
The Vector Space Model for Scoring
Dot Products
 How do we quantify the similarity between two documents in this
vector space?
 A first attempt might consider the magnitude of the vector difference
between two document vectors.
 This measure suffers from a drawback: two documents with very
similar content can have a significant vector difference simply
because one is much longer than the other. Thus the relative
distributions of terms may be identical in the two documents, but the
absolute term frequencies of one may be far larger.
𝑑1
𝑑2
𝑑1
𝑑2
The Vector Space Model for Scoring
Dot Products
 To compensate for the effect of document length, the standard way
of quantifying the similarity between two documents 𝒅𝟏 and 𝒅𝟐 is
to compute the cosine similarity of their vector representations:
 where the numerator represents the dot product (also known as
the inner product) of the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 . The dot product
𝒙 . 𝒚 of two vectors is defined as:
 while the denominator is the product of their Euclidean lengths.
The Vector Space Model for Scoring
Dot Products
 Let 𝑽 𝒅 denote the document vector for 𝒅, with M components
𝑽 𝟏 𝒅 . . . 𝑽 𝑴 𝒅 . The Euclidean length of 𝒅 is defined to be:
 The effect of the denominator of is thus to length-normalize the
vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) =
𝑽 𝒅 𝟏
|𝑽(𝒅 𝟏) |
and 𝒗(𝒅 𝟐) =
𝑽 𝒅 𝟐
|𝑽(𝒅 𝟐) |
 We can then rewrite:
as
The Vector Space Model for Scoring
Dot Products
 The effect of the denominator of is thus to length-normalize the
vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) =
𝑽 𝒅 𝟏
|𝑽(𝒅 𝟏) |
and 𝒗(𝒅 𝟐) =
𝑽 𝒅 𝟐
|𝑽(𝒅 𝟐) |
 Example: for Doc1:
(𝟐𝟕) 𝟐+(𝟑) 𝟐+(𝟎) 𝟐+(𝟏𝟒) 𝟐  𝟗𝟑𝟒  30.56
 27/30.56, 3/30.56, 0/30.56, 14/30.56
The Vector Space Model for Scoring
Cosine Similarity
The Vector Space Model for Scoring
Cosine Similarity
 Example:
The Vector Space Model for Scoring
Cosine Similarity
 Example:
The Vector Space Model for Scoring
Queries as Vectors
 There is a far more compelling reason to represent
documents as vectors: we can also view a query as a vector.
 So, we represent queries as vectors in the space.
 Rank documents according to their proximity to the query in this
space.
 Consider the Query q= jealous gossip
term Query
affection 0
jealous 1
gossip 1
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
 Consider the Query q= jealous gossip
Log frequency weighting After length
normalization
 The key idea now: to assign to each document 𝒅 a score equal to
the dot product
𝑽 𝒒 . 𝑽(𝒅)
term Query
affection 0
jealous 1
gossip 1
wuthering 0
term Query
affection 0
jealous 0.70
gossip 0.70
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
After length normalization
 Recall: We do this because we want to get away from the youʼre-
either-in-or-out Boolean model.
 Instead: rank more relevant documents higher than less relevant
documents
term Query
affection 0
jealous 0.70
gossip 0.70
wuthering 0
The Vector Space Model for Scoring
Queries as Vectors
 To summarize, by viewing a query as a “bag of words”, we are able
to treat it as a very short document.
 As a consequence, we can use the cosine similarity between the
query vector and a document vector as a measure of the score of
the document for that query.
 The resulting scores can then be used to select the top-scoring
documents for a query. Thus we have:
The Vector Space Model for Scoring
Queries as Vectors
The Vector Space Model for Scoring
Computing Vector Scores
 In a typical setting we have a collection of documents each
represented by a vector, a free text query represented by a vector,
and a positive integer K.
 We seek the K documents of the collection with the highest vector
space scores on the given query.
 We now initiate the study of determining the K documents with the
highest vector space scores for a query.
 Typically, we seek these K top documents in ordered by
decreasing score; for instance many search engines use K = 10 to
retrieve and rank-order the first page of the ten best results.
The Vector Space Model for Scoring
Computing Vector Scores
 The array Length holds the lengths (normalization factors) for each
of the N documents, whereas the array Scores holds the scores for
each of the documents. When the scores are finally computed in
Step 9, all that remains in Step 10 is to pick off the K documents
with the highest scores
The Vector Space Model for Scoring
Computing Vector Scores
 The outermost loop beginning Step 3 repeats the updating of Scores,
iterating over each query term t in turn.
 In Step 5 we calculate the weight in the query vector for term t.
 Steps 6-8 update the score of each document by adding in the contribution
from term t.
 This process of adding in contributions one query term at a time is
sometimes known as term-at-a-time scoring or accumulation, and the N
elements of the array Scores are therefore known as accumulators.
The Vector Space Model for Scoring
Computing Vector Scores
 It would appear necessary to store, with each postings entry, the weight
𝒘𝒇 𝒕,𝒅 of term t in document d (we have thus far used either tf or tf-idf for
this weight, but leave open the possibility of other functions to be
developed in later sections).
 In fact this is wasteful, since storing this weight may require a floating
point number.
 Two ideas help alleviate this space problem:
 First, if we are using inverse document frequency, we need not precompute
𝒊𝒅𝒇 𝒕 ; it suffices to store N/𝒅𝒇 𝒕 ; at the head of the postings for t.
 Second, we store the term frequency 𝒕𝒇 𝒕,𝒅 for each postings entry.
 Finally, Step 12 extracts the top K scores – this requires a priority queue
data structure, often implemented using a heap. Such a heap takes no
more than 2N comparisons to construct, following which each of the K top

Ir 08

  • 1.
  • 2.
    The Vector SpaceModel for Scoring Introduction  The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations ranging from scoring documents on a query, document classification and document clustering.  We first develop the basic ideas underlying vector space scoring; a pivotal step in this development is the view of queries as vectors in the same vector space as the document collection.
  • 3.
    The Vector SpaceModel for Scoring Dot Products  We denote by 𝑽(𝒅)the vector derived from document 𝒅, with one component in the vector for each dictionary term.  Unless otherwise specified, you may assume that the components are computed using the 𝒕𝒇 − 𝒊𝒅𝒇 weighting scheme, although the particular weighting scheme is immaterial to the discussion that follows.  The set of documents in a collection then may be viewed as a set of vectors in a vector space, in which there is one axis for each term.  So we have a |V|- dimensional vector space  Terms are axes of the space  Documents are points or vectors in this space  Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine  These are very sparse vectors – most entries are zero
  • 4.
    The Vector SpaceModel for Scoring Dot Products  How do we quantify the similarity between two documents in this vector space?  A first attempt might consider the magnitude of the vector difference between two document vectors.  This measure suffers from a drawback: two documents with very similar content can have a significant vector difference simply because one is much longer than the other. Thus the relative distributions of terms may be identical in the two documents, but the absolute term frequencies of one may be far larger. 𝑑1 𝑑2 𝑑1 𝑑2
  • 5.
    The Vector SpaceModel for Scoring Dot Products  To compensate for the effect of document length, the standard way of quantifying the similarity between two documents 𝒅𝟏 and 𝒅𝟐 is to compute the cosine similarity of their vector representations:  where the numerator represents the dot product (also known as the inner product) of the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 . The dot product 𝒙 . 𝒚 of two vectors is defined as:  while the denominator is the product of their Euclidean lengths.
  • 6.
    The Vector SpaceModel for Scoring Dot Products  Let 𝑽 𝒅 denote the document vector for 𝒅, with M components 𝑽 𝟏 𝒅 . . . 𝑽 𝑴 𝒅 . The Euclidean length of 𝒅 is defined to be:  The effect of the denominator of is thus to length-normalize the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) = 𝑽 𝒅 𝟏 |𝑽(𝒅 𝟏) | and 𝒗(𝒅 𝟐) = 𝑽 𝒅 𝟐 |𝑽(𝒅 𝟐) |  We can then rewrite: as
  • 7.
    The Vector SpaceModel for Scoring Dot Products  The effect of the denominator of is thus to length-normalize the vectors 𝑽(𝒅 𝟏) and 𝑽 𝒅 𝟐 to unit vectors 𝒗(𝒅 𝟏) = 𝑽 𝒅 𝟏 |𝑽(𝒅 𝟏) | and 𝒗(𝒅 𝟐) = 𝑽 𝒅 𝟐 |𝑽(𝒅 𝟐) |  Example: for Doc1: (𝟐𝟕) 𝟐+(𝟑) 𝟐+(𝟎) 𝟐+(𝟏𝟒) 𝟐  𝟗𝟑𝟒  30.56  27/30.56, 3/30.56, 0/30.56, 14/30.56
  • 8.
    The Vector SpaceModel for Scoring Cosine Similarity
  • 9.
    The Vector SpaceModel for Scoring Cosine Similarity  Example:
  • 10.
    The Vector SpaceModel for Scoring Cosine Similarity  Example:
  • 11.
    The Vector SpaceModel for Scoring Queries as Vectors  There is a far more compelling reason to represent documents as vectors: we can also view a query as a vector.  So, we represent queries as vectors in the space.  Rank documents according to their proximity to the query in this space.  Consider the Query q= jealous gossip term Query affection 0 jealous 1 gossip 1 wuthering 0
  • 12.
    The Vector SpaceModel for Scoring Queries as Vectors  Consider the Query q= jealous gossip Log frequency weighting After length normalization  The key idea now: to assign to each document 𝒅 a score equal to the dot product 𝑽 𝒒 . 𝑽(𝒅) term Query affection 0 jealous 1 gossip 1 wuthering 0 term Query affection 0 jealous 0.70 gossip 0.70 wuthering 0
  • 13.
    The Vector SpaceModel for Scoring Queries as Vectors After length normalization  Recall: We do this because we want to get away from the youʼre- either-in-or-out Boolean model.  Instead: rank more relevant documents higher than less relevant documents term Query affection 0 jealous 0.70 gossip 0.70 wuthering 0
  • 14.
    The Vector SpaceModel for Scoring Queries as Vectors  To summarize, by viewing a query as a “bag of words”, we are able to treat it as a very short document.  As a consequence, we can use the cosine similarity between the query vector and a document vector as a measure of the score of the document for that query.  The resulting scores can then be used to select the top-scoring documents for a query. Thus we have:
  • 15.
    The Vector SpaceModel for Scoring Queries as Vectors
  • 16.
    The Vector SpaceModel for Scoring Computing Vector Scores  In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K.  We seek the K documents of the collection with the highest vector space scores on the given query.  We now initiate the study of determining the K documents with the highest vector space scores for a query.  Typically, we seek these K top documents in ordered by decreasing score; for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
  • 17.
    The Vector SpaceModel for Scoring Computing Vector Scores  The array Length holds the lengths (normalization factors) for each of the N documents, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains in Step 10 is to pick off the K documents with the highest scores
  • 18.
    The Vector SpaceModel for Scoring Computing Vector Scores  The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term t in turn.  In Step 5 we calculate the weight in the query vector for term t.  Steps 6-8 update the score of each document by adding in the contribution from term t.  This process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the N elements of the array Scores are therefore known as accumulators.
  • 19.
    The Vector SpaceModel for Scoring Computing Vector Scores  It would appear necessary to store, with each postings entry, the weight 𝒘𝒇 𝒕,𝒅 of term t in document d (we have thus far used either tf or tf-idf for this weight, but leave open the possibility of other functions to be developed in later sections).  In fact this is wasteful, since storing this weight may require a floating point number.  Two ideas help alleviate this space problem:  First, if we are using inverse document frequency, we need not precompute 𝒊𝒅𝒇 𝒕 ; it suffices to store N/𝒅𝒇 𝒕 ; at the head of the postings for t.  Second, we store the term frequency 𝒕𝒇 𝒕,𝒅 for each postings entry.  Finally, Step 12 extracts the top K scores – this requires a priority queue data structure, often implemented using a heap. Such a heap takes no more than 2N comparisons to construct, following which each of the K top