Disk Based Data Structures
   So far search trees were limited to main
    memory structures
       Assumption: the dataset organized in a search tree
        fits in main memory (including the tree overhead)
   Counter-example: transaction data of a bank >
    1 GB per day
       use secondary storage media (punch cards, hard
        disks, magnetic tapes, etc.)
   Consequence: make a search tree structure
    secondary-storage-enabled
Hard Disks

 Large amounts of
  storage, but slow
  access!
 Identifying a page
  takes a long time (seek
  time plus rotational
  delay – 5-
  10ms), reading it is fast
       It pays off to read or
        write data in pages (or
        blocks) of 2-16 Kb in
        size.
Algorithm analysis
   The running time of disk-based algorithms is
    measured in terms of
     computing time (CPU)
     number of disk accesses
         sequential reads
         random reads



   Regular main-memory algorithms that work one
    data element at a time can not be “ported” to
    secondary storage in a straight-forward way
Principles

 Pointers   in data structures are no longer
  addresses in main memory but
  locations in files
 If x is a pointer to an object
    ifx is in main memory key[x] refers to it
    otherwise DiskRead(x) reads the object
     from disk into main memory (DiskWrite(x) –
     writes it back to disk)
Principles (2)
  A typical    working pattern
    01   …
    02   x    a pointer to some   object
    03   DiskRead(x)
    04   operations that access   and/or modify x
    05   DiskWrite(x) //omitted   if nothing changed
    06   other operations, only   access no modify
    07   …

  Operations:
    DiskRead(x:pointer_to_a_node)
    DiskWrite(x:pointer_to_a_node)
    AllocateNode():pointer_to_a_node
Binary-trees vs. B-trees
 Size of B-tree nodes is determined by the page
  size. One page – one node.
 A B-tree of height 2 may contain > 1 billion keys!
 Heights of Binary-tree and B-tree are logarithmic
     B-tree: logarithm of base, e.g., 1000
     Binary-tree: logarithm of base 2

                                                          1 node
                         1000                             1000 keys
                             1001
                                                          1001 nodes,
     1000             1000              …          1000
                                                          1,001,000 keys
        1001                 1001                  1001
                                                          1,002,001 nodes,
               1000    1000         …       1000
                                                          1,002,001,000 keys
B-tree Definitions
   Node x has fields
     n[x]: the number of keys of that the node
     key1[x] … keyn[x][x]: the keys in ascending order
     leaf[x]: true if leaf node, false if internal node
     if internal node, then c1[x], …, cn[x]+1[x]: pointers to
      children
   Keys separate the ranges of keys in the sub-
    trees. If ki is an arbitrary key in the subtree ci[x]
    then ki keyi[x] ki+1
B-tree Definitions (2)

 Every  leaf has the same depth
 In a B-tree of a degree t all nodes except
  the root node have between t and 2t
  children (i.e., between t–1 and 2t–1 keys).
 The root node has between 0 and 2t
  children (i.e., between 0 and 2t–1 keys)
Height of a B-tree
   B-tree T of height h, containing n 1 keys and
    minimum degree t 2, the following restriction
    on the height holds:          n 1
                          h log t         depth
                                                #of
                                   2            nodes
                   1
                                            0 1

        t-1                                 t-1            1   2
    t                                  t

t-1      t-1   …   t-1                t-1    t-1   … t-1   2   2t
                     h
         n 1 (t 1)         2t i   1
                                      2t h 1
                     i 1
B-tree Operations

 An implementation needs to suport the
 following B-tree operations
   Searching   (simple)
   Creating an empty tree (trivial)
   Insertion (complex)
   Deletion (complex)
Searching
 Straightforward      generalization of a binary
 tree search
   BTreeSearch(x,k)
   01   i    1
   02   while i   n[x] and k > keyi[x]
   03      i    i+1
   04   if i   n[x] and k = keyi[x] then
   05      return(x,i)
   06   if leaf[x] then
   08      return NIL
   09      else DiskRead(ci[x])
   10           return BTtreeSearch(ci[x],k)
Creating an Empty Tree
 Empty     B-tree = create a root & write it to
 disk!

BTreeCreate(T)
01   x    AllocateNode();
02   leaf[x]    TRUE;
03   n[x]    0;
04   DiskWrite(x);
05   root[T]    x
Splitting Nodes

 Nodes  fill up and reach their maximum
  capacity 2t – 1
 Before we can insert a new key, we have
  to “make room,” i.e., split nodes
Splitting Nodes (2)
 Result:
        one key of x moves up to parent +
 2 nodes with t-1 keys


      x                            x
            ... N W ...                ... N S W ...
      y = ci[x]
                               y = ci[x]           z = ci+1[x]
       P Q R S T V W
                                  P Q R          T V W


 T1               ...     T8
Splitting Nodes (2)
BTreeSplitChild(x,i,y)
 z     AllocateNode()
 leaf[z]      leaf[y]            x: parent node
 n[z]      t-1                   y: node to be split and child of x
 for j      1 to t-1             i: index in x
     keyj[z]     keyj+t[y]
                                 z: new node
 if not leaf[y] then
     for j     1 to t
         cj[z]    cj+t[y]
 n[y]      t-1                       x
 for j      n[x]+1 downto i+1
     cj+1[x]    cj[x]                      ... N W ...
 ci+1[x]     z
 for j      n[x] downto i            y = ci[x]
     keyj+1[x]     keyj[x]
 keyi[x]      keyt[y]                 P Q R S T V W
 n[x]      n[x]+1
 DiskWrite(y)
 DiskWrite(z)                   T1               ...         T8
 DiskWrite(x)
Split: Running Time
 A local operation that does not traverse
  the tree
 (t) CPU-time, since two loops run t times
 3 I/Os
Inserting Keys
 Done   recursively, by starting from the root
  and recursively traversing down the tree to
  the leaf level
 Before descending to a lower level in the
  tree, make sure that the node contains <
  2t – 1 keys:
   so  that if we split a node in a lower level we
    will have space to include a new key
Inserting Keys (2)
 Special   case: root is full (BtreeInsert)

 BTreeInsert(T)
 r    root[T]
 if n[r] = 2t – 1 then
    s    AllocateNode()
    root[T]    s
    leaf[s]    FALSE
    n[s]    0
    c1[s]    r
    BTreeSplitChild(s,1,r)
    BTreeInsertNonFull(s,k)
 else BTreeInsertNonFull(r,k)
Splitting the Root

 Splitting   the root requires the creation of a
  new root
                              root[T]
   root[T]                                s
                     r
                                          H
        A D F H L N P
                                   r
                                  A D F       L N P
   T1          ...       T8

 The  tree grows at the top instead of the
  bottom
Inserting Keys
  BtreeNonFull  tries to insert a key k into
   a node x, which is assumed to be non-
   full when the procedure is called
  BTreeInsert and the recursion in
   BTreeInsertNonFull guarantee that this
   assumption is true!
Inserting Keys: Pseudo Code
BTreeInsertNonFull(x,k)
01 i    n[x]
02 if leaf[x] then
03    while i    1 and k < keyi[x]
04       keyi+1[x]    keyi[x]
05       i     i - 1                    leaf insertion
06    keyi+1[x]    k
07    n[x]     n[x] + 1
08    DiskWrite(x)
09 else while i    1 and k < keyi[x]
10       i     i - 1
11    i    i + 1                       internal node:
12    DiskRead ci[x]                   traversing tree
13    if n[ci[x]] = 2t – 1 then
14       BTreeSplitChild(x,i,ci[x])
15       if k > keyi[x] then
16           i    i + 1
17    BTreeInsertNonFull(ci[x],k)
Insertion: Example
 initial tree (t = 3)
                              G M P X

     A C D E            J K    N O    R S T U V     Y Z
  B inserted
                              G M P X

  A B C D E             J K     N O   R S T U V     Y Z


  Q inserted
                              G M P T X

  A B C D E             J K     N O   Q R S   U V    Y Z
Insertion: Example (2)

   L inserted             P

                 G M             T X

   A B C D E     J K L   N O   Q R S   U V   Y Z



    F inserted            P

                 C G M           T X

   A B   D E F   J K L   N O   Q R S   U V   Y Z
Insertion: Running Time

 Disk  I/O: O(h), since only O(1) disk
  accesses are performed during recursive
  calls of BTreeInsertNonFull
 CPU: O(th) = O(t logtn)
 At any given time there are O(1) number
  of disk pages in main memory
Deleting Keys
 Done recursively, by starting from the root and
  recursively traversing down the tree to the leaf
  level
 Before descending to a lower level in the
  tree, make sure that the node contains t keys
  (cf. insertion < 2t – 1 keys)
 BtreeDelete distinguishes three different
  stages/scenarios for deletion
     Case 1: key k found in leaf node
     Case 2: key k found in internal node
     Case 3: key k suspected in lower level node
Deleting Keys (2)
    initial tree              P


                    C G M            T X

    A B   D E F     J K L    N O   Q R S    U V   Y Z

     F deleted:                P
     case 1
                    C G M             T X

      A B     D E    J K L   N O   Q R S    U V   Y Z
               x

   Case 1: If the key k is in node x, and x is a leaf,
    delete k from x
Deleting Keys (3)
   Case 2: If the key k is in node x, and x is not a
    leaf, delete k from x
     a) If the child y that precedes k in node x has at least t
      keys, then find the predecessor k’ of k in the sub-tree
      rooted at y. Recursively delete k’, and replace k with
      k’ in x.
     b) Symmetrically for successor node z



         M deleted:                  P
         case 2a
                       C G L     x         T X


             A B    D E   J K   N O      Q R S   U V   Y Z
                           y
Deleting Keys (4)
   If both y and z have only t –1 keys, merge k with
    the contents of z into y, so that x loses both k
    and the pointers to z, and y now contains 2t – 1
    keys. Free z and recursively delete k from y.


      G deleted:                  P
      case 2c
                       C L    x-k       T X

          A B      D E J K    N O     Q R S   U V   Y Z

                y = y+k + z - k
Deleting Keys - Distribution
 Descending down the tree: if k not found in
  current node x, find the sub-tree ci[x] that has to
  contain k.
 If ci[x] has only t – 1 keys take action to ensure
  that we descent to a node of size at least t.
 We can encounter two cases.
       If ci[x] has only t-1 keys, but a sibling with at least t
        keys, give ci[x] an extra key by moving a key from x to
        ci[x], moving a key from ci[x]’s immediate left and right
        sibling up into x, and moving the appropriate child from
        the sibling into ci[x] - distribution
Deleting Keys – Distribution(2)
   x              ... k ...                           x            ... k’ ...


ci[x]       ...               k’                   ci[x]   ... k

                  A       B                                A B
                                             C L P T X
        delete B
        ci[x]         A B      E J K         N O   Q R S   U V          Y Z
                                   sibling

        B deleted:                             E L P T X

                              A C     J K     N O      Q R S   U V              Y Z
Deleting Keys - Merging
   If    ci[x] and both of ci[x]’s siblings have t – 1
        keys, merge ci with one sibling, which
        involves moving a key from x down into
        the new merged node to become the
        median key for that node

    x         ... l’ k m’...       x             ... l’ m’ ...


ci[x]        ... l          m…         ...l k m ...

                A       B               A B
Deleting Keys – Merging (2)
                                 P

delete D         ci[x]   C L     sibling   T X


           A B   D E J K       N O   Q R S       U V   Y Z




D deleted:
                               C L P T X

           A B    E J K        N O   Q R S       U V   Y Z

tree shrinks in height
Deletion: Running Time
 Most of the keys are in the leaf, thus deletion
  most often occurs there!
 In this case deletion happens in one downward
  pass to the leaf level of the tree
 Deletion from an internal node might require
  “backing up” (case 2)
 Disk I/O: O(h), since only O(1) disk operations
  are produced during recursive calls
 CPU: O(th) = O(t logtn)
Two-pass Operations
 Simpler,
         practical versions of algorithms
 use two passes (down and up the tree):
   Down   – Find the node where deletion or
    insertion should occur
   Up – If needed, split, merge, or distribute;
    propagate splits, merges, or distributes up the
    tree
 Toavoid reading the same nodes
 twice, use a buffer of nodes

Lec16

  • 1.
    Disk Based DataStructures  So far search trees were limited to main memory structures  Assumption: the dataset organized in a search tree fits in main memory (including the tree overhead)  Counter-example: transaction data of a bank > 1 GB per day  use secondary storage media (punch cards, hard disks, magnetic tapes, etc.)  Consequence: make a search tree structure secondary-storage-enabled
  • 2.
    Hard Disks  Largeamounts of storage, but slow access!  Identifying a page takes a long time (seek time plus rotational delay – 5- 10ms), reading it is fast  It pays off to read or write data in pages (or blocks) of 2-16 Kb in size.
  • 3.
    Algorithm analysis  The running time of disk-based algorithms is measured in terms of  computing time (CPU)  number of disk accesses  sequential reads  random reads  Regular main-memory algorithms that work one data element at a time can not be “ported” to secondary storage in a straight-forward way
  • 4.
    Principles  Pointers in data structures are no longer addresses in main memory but locations in files  If x is a pointer to an object  ifx is in main memory key[x] refers to it  otherwise DiskRead(x) reads the object from disk into main memory (DiskWrite(x) – writes it back to disk)
  • 5.
    Principles (2) A typical working pattern 01 … 02 x a pointer to some object 03 DiskRead(x) 04 operations that access and/or modify x 05 DiskWrite(x) //omitted if nothing changed 06 other operations, only access no modify 07 …  Operations:  DiskRead(x:pointer_to_a_node)  DiskWrite(x:pointer_to_a_node)  AllocateNode():pointer_to_a_node
  • 6.
    Binary-trees vs. B-trees Size of B-tree nodes is determined by the page size. One page – one node.  A B-tree of height 2 may contain > 1 billion keys!  Heights of Binary-tree and B-tree are logarithmic  B-tree: logarithm of base, e.g., 1000  Binary-tree: logarithm of base 2 1 node 1000 1000 keys 1001 1001 nodes, 1000 1000 … 1000 1,001,000 keys 1001 1001 1001 1,002,001 nodes, 1000 1000 … 1000 1,002,001,000 keys
  • 7.
    B-tree Definitions  Node x has fields  n[x]: the number of keys of that the node  key1[x] … keyn[x][x]: the keys in ascending order  leaf[x]: true if leaf node, false if internal node  if internal node, then c1[x], …, cn[x]+1[x]: pointers to children  Keys separate the ranges of keys in the sub- trees. If ki is an arbitrary key in the subtree ci[x] then ki keyi[x] ki+1
  • 8.
    B-tree Definitions (2) Every leaf has the same depth  In a B-tree of a degree t all nodes except the root node have between t and 2t children (i.e., between t–1 and 2t–1 keys).  The root node has between 0 and 2t children (i.e., between 0 and 2t–1 keys)
  • 9.
    Height of aB-tree  B-tree T of height h, containing n 1 keys and minimum degree t 2, the following restriction on the height holds: n 1 h log t depth #of 2 nodes 1 0 1 t-1 t-1 1 2 t t t-1 t-1 … t-1 t-1 t-1 … t-1 2 2t h n 1 (t 1) 2t i 1 2t h 1 i 1
  • 10.
    B-tree Operations  Animplementation needs to suport the following B-tree operations  Searching (simple)  Creating an empty tree (trivial)  Insertion (complex)  Deletion (complex)
  • 11.
    Searching  Straightforward generalization of a binary tree search BTreeSearch(x,k) 01 i 1 02 while i n[x] and k > keyi[x] 03 i i+1 04 if i n[x] and k = keyi[x] then 05 return(x,i) 06 if leaf[x] then 08 return NIL 09 else DiskRead(ci[x]) 10 return BTtreeSearch(ci[x],k)
  • 12.
    Creating an EmptyTree  Empty B-tree = create a root & write it to disk! BTreeCreate(T) 01 x AllocateNode(); 02 leaf[x] TRUE; 03 n[x] 0; 04 DiskWrite(x); 05 root[T] x
  • 13.
    Splitting Nodes  Nodes fill up and reach their maximum capacity 2t – 1  Before we can insert a new key, we have to “make room,” i.e., split nodes
  • 14.
    Splitting Nodes (2) Result: one key of x moves up to parent + 2 nodes with t-1 keys x x ... N W ... ... N S W ... y = ci[x] y = ci[x] z = ci+1[x] P Q R S T V W P Q R T V W T1 ... T8
  • 15.
    Splitting Nodes (2) BTreeSplitChild(x,i,y) z AllocateNode() leaf[z] leaf[y] x: parent node n[z] t-1 y: node to be split and child of x for j 1 to t-1 i: index in x keyj[z] keyj+t[y] z: new node if not leaf[y] then for j 1 to t cj[z] cj+t[y] n[y] t-1 x for j n[x]+1 downto i+1 cj+1[x] cj[x] ... N W ... ci+1[x] z for j n[x] downto i y = ci[x] keyj+1[x] keyj[x] keyi[x] keyt[y] P Q R S T V W n[x] n[x]+1 DiskWrite(y) DiskWrite(z) T1 ... T8 DiskWrite(x)
  • 16.
    Split: Running Time A local operation that does not traverse the tree  (t) CPU-time, since two loops run t times  3 I/Os
  • 17.
    Inserting Keys  Done recursively, by starting from the root and recursively traversing down the tree to the leaf level  Before descending to a lower level in the tree, make sure that the node contains < 2t – 1 keys:  so that if we split a node in a lower level we will have space to include a new key
  • 18.
    Inserting Keys (2) Special case: root is full (BtreeInsert) BTreeInsert(T) r root[T] if n[r] = 2t – 1 then s AllocateNode() root[T] s leaf[s] FALSE n[s] 0 c1[s] r BTreeSplitChild(s,1,r) BTreeInsertNonFull(s,k) else BTreeInsertNonFull(r,k)
  • 19.
    Splitting the Root Splitting the root requires the creation of a new root root[T] root[T] s r H A D F H L N P r A D F L N P T1 ... T8  The tree grows at the top instead of the bottom
  • 20.
    Inserting Keys BtreeNonFull tries to insert a key k into a node x, which is assumed to be non- full when the procedure is called  BTreeInsert and the recursion in BTreeInsertNonFull guarantee that this assumption is true!
  • 21.
    Inserting Keys: PseudoCode BTreeInsertNonFull(x,k) 01 i n[x] 02 if leaf[x] then 03 while i 1 and k < keyi[x] 04 keyi+1[x] keyi[x] 05 i i - 1 leaf insertion 06 keyi+1[x] k 07 n[x] n[x] + 1 08 DiskWrite(x) 09 else while i 1 and k < keyi[x] 10 i i - 1 11 i i + 1 internal node: 12 DiskRead ci[x] traversing tree 13 if n[ci[x]] = 2t – 1 then 14 BTreeSplitChild(x,i,ci[x]) 15 if k > keyi[x] then 16 i i + 1 17 BTreeInsertNonFull(ci[x],k)
  • 22.
    Insertion: Example initialtree (t = 3) G M P X A C D E J K N O R S T U V Y Z B inserted G M P X A B C D E J K N O R S T U V Y Z Q inserted G M P T X A B C D E J K N O Q R S U V Y Z
  • 23.
    Insertion: Example (2) L inserted P G M T X A B C D E J K L N O Q R S U V Y Z F inserted P C G M T X A B D E F J K L N O Q R S U V Y Z
  • 24.
    Insertion: Running Time Disk I/O: O(h), since only O(1) disk accesses are performed during recursive calls of BTreeInsertNonFull  CPU: O(th) = O(t logtn)  At any given time there are O(1) number of disk pages in main memory
  • 25.
    Deleting Keys  Donerecursively, by starting from the root and recursively traversing down the tree to the leaf level  Before descending to a lower level in the tree, make sure that the node contains t keys (cf. insertion < 2t – 1 keys)  BtreeDelete distinguishes three different stages/scenarios for deletion  Case 1: key k found in leaf node  Case 2: key k found in internal node  Case 3: key k suspected in lower level node
  • 26.
    Deleting Keys (2) initial tree P C G M T X A B D E F J K L N O Q R S U V Y Z F deleted: P case 1 C G M T X A B D E J K L N O Q R S U V Y Z x  Case 1: If the key k is in node x, and x is a leaf, delete k from x
  • 27.
    Deleting Keys (3)  Case 2: If the key k is in node x, and x is not a leaf, delete k from x  a) If the child y that precedes k in node x has at least t keys, then find the predecessor k’ of k in the sub-tree rooted at y. Recursively delete k’, and replace k with k’ in x.  b) Symmetrically for successor node z M deleted: P case 2a C G L x T X A B D E J K N O Q R S U V Y Z y
  • 28.
    Deleting Keys (4)  If both y and z have only t –1 keys, merge k with the contents of z into y, so that x loses both k and the pointers to z, and y now contains 2t – 1 keys. Free z and recursively delete k from y. G deleted: P case 2c C L x-k T X A B D E J K N O Q R S U V Y Z y = y+k + z - k
  • 29.
    Deleting Keys -Distribution  Descending down the tree: if k not found in current node x, find the sub-tree ci[x] that has to contain k.  If ci[x] has only t – 1 keys take action to ensure that we descent to a node of size at least t.  We can encounter two cases.  If ci[x] has only t-1 keys, but a sibling with at least t keys, give ci[x] an extra key by moving a key from x to ci[x], moving a key from ci[x]’s immediate left and right sibling up into x, and moving the appropriate child from the sibling into ci[x] - distribution
  • 30.
    Deleting Keys –Distribution(2) x ... k ... x ... k’ ... ci[x] ... k’ ci[x] ... k A B A B C L P T X delete B ci[x] A B E J K N O Q R S U V Y Z sibling B deleted: E L P T X A C J K N O Q R S U V Y Z
  • 31.
    Deleting Keys -Merging  If ci[x] and both of ci[x]’s siblings have t – 1 keys, merge ci with one sibling, which involves moving a key from x down into the new merged node to become the median key for that node x ... l’ k m’... x ... l’ m’ ... ci[x] ... l m… ...l k m ... A B A B
  • 32.
    Deleting Keys –Merging (2) P delete D ci[x] C L sibling T X A B D E J K N O Q R S U V Y Z D deleted: C L P T X A B E J K N O Q R S U V Y Z tree shrinks in height
  • 33.
    Deletion: Running Time Most of the keys are in the leaf, thus deletion most often occurs there!  In this case deletion happens in one downward pass to the leaf level of the tree  Deletion from an internal node might require “backing up” (case 2)  Disk I/O: O(h), since only O(1) disk operations are produced during recursive calls  CPU: O(th) = O(t logtn)
  • 34.
    Two-pass Operations  Simpler, practical versions of algorithms use two passes (down and up the tree):  Down – Find the node where deletion or insertion should occur  Up – If needed, split, merge, or distribute; propagate splits, merges, or distributes up the tree  Toavoid reading the same nodes twice, use a buffer of nodes