Introduction
• Hashing is an important data structure designed to solve the problem of
efficiently finding and storing data in an array. For example, if you
have a list of 20000 numbers, and you have given a number to search
in that list- you will scan each number in the list until you find a match.
• Hashing is mainly used to overcome the drawbacks of linear sets
&binary sets.
• O(n)-time complexity of linear search
• O(logn)-time complexity for binary search
• O(1)-hashing
What is hashing?
• Hashing is a popular technique in computer science that involves mapping large data sets
to fixed-length values. It is a process of converting a data set of variable size into a data
set of a fixed size.
• It is also known as the message digest function.
• It is a technique that uniquely identifies a specific item from a collection of similar items.
• Hashing is a technique in which we can perform searching operation in a constant
computing time.
Static hashing
1.Hash tables
2.Hash functions
Hash tables
• Hash table is data structure(array) which contains fixed cells where
each cell contains key & value.
• It can be defined as a bucket where the data are stored in an
array format. These data have their own index value. If the
index values are known then the process of accessing the
data is quicker.
Hash function
• The mathematical function to be applied on keys to obtain
indexes for their corresponding values into the Hash Table.
• Hash Function can be defined as an algorithm or a function
that is used to map or convert data of bigger size or length
to a fixed or small index or hash value
• The hashing techniques in the data structure are very
interesting, such as:
• hash = hashfunc(key)
• index = hash % array_size
Hash functions
1.Division Method
2.Mid Square Method
3.Folding Method(pairing)
4.Multiplication Method
Division Method
• By performing module operation we have to insert record into a hash
table.
• Say that we have a Hash Table of size 'S', and we want to
store a (key, value) pair in the Hash Table. The Hash
Function, according to the Division method, would
be:H(key) = key mod M
• Here M is an integer value used for calculating the Hash
value, and M should be greater than S. Sometimes, S is used
as M.
Size of the Hash Table = 5 (M, S)
Key: Value pairs: {10: "Sudha", 11: "Venkat", 12: "Jeevani"}
For every pair:
{10: "Sudha"}
Key mod M = 10 mod 5 = 0
{11: "Venkat"}
Key mod M = 11 mod 5 = 1
{12: "Jeevani"}
Key mod M = 12 mod 5 = 2
Mid square method
• It is a two-step process of computing the Hash value. Given a {key:
value} pair, the Hash Function would be calculated by:
1.Square the key -> key * key
2.Choose some digits from the middle of the number to obtain the Hash
value.
• Suppose the size of the Hash Table is 10 and the key: value pairs are:
{10: "Sudha, 11: "Venkat", 12: "Jeevani"}
Number of digits to be selected: Indexes: (0 - 9), so 1
H(10) = 10 * 10 = 100 = 0
H(11) = 11 * 11 = 121 = 2
3. Folding Method
• Given a {key: value} pair and the table size is 100 (0 - 99
indexes), the key is broken down into 2 segments each
except the last segment. The last segment can have less
number of digits. Now, the Hash Function would be:
• H(x) = (sum of equal-
sized segments) mod (size of the Hash Table)
• For suppose "k" is a 10-digit key and the size of the table is
100(0 - 99), k is divided into:
sum = (k1k2) + (k3k4) + (k5k6) + (k7k8) + (k9k10)
Now, H(x) = sum % 100
Let us now take an example:
The {key: value} pairs: {1234: "Sudha", 5678: "Venkat"}
Size of the table: 100 (0 - 99)
For {1234: "Sudha"}:
1234 = 12 + 34 = 46
46 % 100 = 46
For {5678: "Venkat"}:
5678 = 56 + 78 = 134
134 % 100 = 34
Multiplication method
1.We must choose a constant between 0 and 1, say, A.
2.Multiply the key with the chosen A.
3.Now, take the fractional part from the product and multiply it by the
table size.
4.The Hash will be the floor (only the integer part) of the above result.
• So, the Hash Function under this method will be:
H(x) = floor(size(key*A mod 1))
For example:
{Key: value} pairs: {1234: "Sudha", 5678: "Venkat"}
Size of the table: 100
A = 0.56
For {1234: "Sudha"}:
H(1234) = floor(size(1234*0.56 mod 1))
= floor(100 * 0.04)
= floor(4) = 4
For {5678: "Venkat"}:
H(5678) = floor(size(5678*0.56 mod 1))
= floor(99 * 0.68)
= floor(67.32)
= 67
Collision in hashing
1.in this, the hash function is used to compute the index of the array.
2.The hash value is used to store the key in the hash table, as an index.
3.The hash function can return the same hash value for two or more keys.
4.When two or more keys are given the same hash value, it is called
a collision. To handle this collision, we use collision resolution techniques.
Collision resolution techniques
1.Separate chaining (open hashing)
2.Open addressing (closed hashing)->
1.Linear probing
2.Quadratic probing
3.Double hashing
Separate chaining technique
• Means maintaining linked lists for storing records
• In separate chaining technique, each bucket in hash table is associated with
a linked list or some other data structure that can store multiple elements.
• When a collision occurs, the colliding elements are added to the linked list
associated with that bucket.
• The key and its corresponding value are stored as a node in list
Separate chaining technique
Advantages
• There is no problem in overflow & collision.
Disadvantages
It requires more memory in order to store the data
Open Addressing
• In open addressing, when a collision occurs, the hash table’s slots
themselves are used to store the colliding elements. If a slot is already
occupied, the algorithm probes or searches for the next available slot
in a predetermined manner.
Linear probing
• In linear probing, if a collision occurs at a specific slot, the algorithm sequentially
searches for the next available (empty) slot by incrementing the index one by one
until it finds an empty slot.
• - The probing sequence is defined by the linear function:
• ‘hash(key) + i’
• where ‘hash(key)’ is the original hash value and ‘I’ is the probe number.
Drawbacks of linear probing
1.The main problem is primary clustering(most of the records stored in
a single cluster).
2.It takes too much time to find an empty slot.
Quadratic probing
• In this, when the collision occurs, we probe for i2th
slot in ith
iteration,
and this probing is performed until an empty slot is found.
• The cache performance in quadratic probing is lower than the linear
probing. Quadratic probing also reduces the problem of clustering.
• Drawbacks
Secondary clustering-whenever the half of the hash table is full it is
very difficult to find the location of the new element.
Double hashing
• H(K)=K%10
• H2(x) = P - (x%P), where P is a prime number smaller than N.
• It takes longer to determine two hash functions. The double probing
gives the very poor the cache performance, but there has no clustering
problem in it.
Dynamic hashing
• The drawback of static hashing is that it does not expand or shrink
dynamically as the size of the database grow or shrink.
• In dynamic hashing data buckets grow or shrinks(added or removed
dynamically) as records increases or decreases.
• Dynamic hashing is also called extended hashing.
Dynamic hashing/extended hashing
• It is a dynamic hashing method where in directories &buckets are used to
hash the data.
• It is aggressively flexible method in which the hash function also
experiences dynamic change.
• Directories store addresses of the buckets in pointer and id assigned to
each directory. Which may change each time when directory expansion
takes place.
• Buckets are used to hash the actual data.
Frequently used terms in Extendible Hashing:
• Directories: These containers store pointers to buckets. Each directory is given a unique id which may change
each time when expansion takes place.
• Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more than one pointers
to it if its local depth is less than the global depth.
• Global Depth: It is associated with the Directories. They denote the number of bits which are used by the hash
function to categorize the keys. Global Depth = Number of bits in directory id.
• Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is associated with the
buckets and not the directories. Local depth in accordance with the global depth is used to decide the action that
to be performed in case an overflow occurs. Local Depth is always less than or equal to the Global Depth.
Frequently used terms in Extendible Hashing:
• Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the
bucket is split into two parts.
• Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory
Expansion is performed when the local depth of the overflowing bucket is equal to the global depth.
hashing in data structures and its applications

hashing in data structures and its applications

  • 1.
    Introduction • Hashing isan important data structure designed to solve the problem of efficiently finding and storing data in an array. For example, if you have a list of 20000 numbers, and you have given a number to search in that list- you will scan each number in the list until you find a match. • Hashing is mainly used to overcome the drawbacks of linear sets &binary sets. • O(n)-time complexity of linear search • O(logn)-time complexity for binary search • O(1)-hashing
  • 2.
    What is hashing? •Hashing is a popular technique in computer science that involves mapping large data sets to fixed-length values. It is a process of converting a data set of variable size into a data set of a fixed size. • It is also known as the message digest function. • It is a technique that uniquely identifies a specific item from a collection of similar items. • Hashing is a technique in which we can perform searching operation in a constant computing time.
  • 3.
  • 4.
    Hash tables • Hashtable is data structure(array) which contains fixed cells where each cell contains key & value. • It can be defined as a bucket where the data are stored in an array format. These data have their own index value. If the index values are known then the process of accessing the data is quicker.
  • 5.
    Hash function • Themathematical function to be applied on keys to obtain indexes for their corresponding values into the Hash Table. • Hash Function can be defined as an algorithm or a function that is used to map or convert data of bigger size or length to a fixed or small index or hash value • The hashing techniques in the data structure are very interesting, such as: • hash = hashfunc(key) • index = hash % array_size
  • 6.
    Hash functions 1.Division Method 2.MidSquare Method 3.Folding Method(pairing) 4.Multiplication Method
  • 7.
    Division Method • Byperforming module operation we have to insert record into a hash table. • Say that we have a Hash Table of size 'S', and we want to store a (key, value) pair in the Hash Table. The Hash Function, according to the Division method, would be:H(key) = key mod M • Here M is an integer value used for calculating the Hash value, and M should be greater than S. Sometimes, S is used as M.
  • 8.
    Size of theHash Table = 5 (M, S) Key: Value pairs: {10: "Sudha", 11: "Venkat", 12: "Jeevani"} For every pair: {10: "Sudha"} Key mod M = 10 mod 5 = 0 {11: "Venkat"} Key mod M = 11 mod 5 = 1 {12: "Jeevani"} Key mod M = 12 mod 5 = 2
  • 9.
    Mid square method •It is a two-step process of computing the Hash value. Given a {key: value} pair, the Hash Function would be calculated by: 1.Square the key -> key * key 2.Choose some digits from the middle of the number to obtain the Hash value. • Suppose the size of the Hash Table is 10 and the key: value pairs are: {10: "Sudha, 11: "Venkat", 12: "Jeevani"} Number of digits to be selected: Indexes: (0 - 9), so 1 H(10) = 10 * 10 = 100 = 0 H(11) = 11 * 11 = 121 = 2
  • 10.
    3. Folding Method •Given a {key: value} pair and the table size is 100 (0 - 99 indexes), the key is broken down into 2 segments each except the last segment. The last segment can have less number of digits. Now, the Hash Function would be: • H(x) = (sum of equal- sized segments) mod (size of the Hash Table) • For suppose "k" is a 10-digit key and the size of the table is 100(0 - 99), k is divided into: sum = (k1k2) + (k3k4) + (k5k6) + (k7k8) + (k9k10) Now, H(x) = sum % 100
  • 11.
    Let us nowtake an example: The {key: value} pairs: {1234: "Sudha", 5678: "Venkat"} Size of the table: 100 (0 - 99) For {1234: "Sudha"}: 1234 = 12 + 34 = 46 46 % 100 = 46 For {5678: "Venkat"}: 5678 = 56 + 78 = 134 134 % 100 = 34
  • 12.
    Multiplication method 1.We mustchoose a constant between 0 and 1, say, A. 2.Multiply the key with the chosen A. 3.Now, take the fractional part from the product and multiply it by the table size. 4.The Hash will be the floor (only the integer part) of the above result. • So, the Hash Function under this method will be: H(x) = floor(size(key*A mod 1))
  • 13.
    For example: {Key: value}pairs: {1234: "Sudha", 5678: "Venkat"} Size of the table: 100 A = 0.56 For {1234: "Sudha"}: H(1234) = floor(size(1234*0.56 mod 1)) = floor(100 * 0.04) = floor(4) = 4 For {5678: "Venkat"}: H(5678) = floor(size(5678*0.56 mod 1)) = floor(99 * 0.68) = floor(67.32) = 67
  • 14.
    Collision in hashing 1.inthis, the hash function is used to compute the index of the array. 2.The hash value is used to store the key in the hash table, as an index. 3.The hash function can return the same hash value for two or more keys. 4.When two or more keys are given the same hash value, it is called a collision. To handle this collision, we use collision resolution techniques.
  • 15.
    Collision resolution techniques 1.Separatechaining (open hashing) 2.Open addressing (closed hashing)-> 1.Linear probing 2.Quadratic probing 3.Double hashing
  • 16.
    Separate chaining technique •Means maintaining linked lists for storing records • In separate chaining technique, each bucket in hash table is associated with a linked list or some other data structure that can store multiple elements. • When a collision occurs, the colliding elements are added to the linked list associated with that bucket. • The key and its corresponding value are stored as a node in list
  • 17.
    Separate chaining technique Advantages •There is no problem in overflow & collision. Disadvantages It requires more memory in order to store the data
  • 18.
    Open Addressing • Inopen addressing, when a collision occurs, the hash table’s slots themselves are used to store the colliding elements. If a slot is already occupied, the algorithm probes or searches for the next available slot in a predetermined manner.
  • 19.
    Linear probing • Inlinear probing, if a collision occurs at a specific slot, the algorithm sequentially searches for the next available (empty) slot by incrementing the index one by one until it finds an empty slot. • - The probing sequence is defined by the linear function: • ‘hash(key) + i’ • where ‘hash(key)’ is the original hash value and ‘I’ is the probe number.
  • 20.
    Drawbacks of linearprobing 1.The main problem is primary clustering(most of the records stored in a single cluster). 2.It takes too much time to find an empty slot.
  • 21.
    Quadratic probing • Inthis, when the collision occurs, we probe for i2th slot in ith iteration, and this probing is performed until an empty slot is found. • The cache performance in quadratic probing is lower than the linear probing. Quadratic probing also reduces the problem of clustering. • Drawbacks Secondary clustering-whenever the half of the hash table is full it is very difficult to find the location of the new element.
  • 22.
    Double hashing • H(K)=K%10 •H2(x) = P - (x%P), where P is a prime number smaller than N. • It takes longer to determine two hash functions. The double probing gives the very poor the cache performance, but there has no clustering problem in it.
  • 23.
    Dynamic hashing • Thedrawback of static hashing is that it does not expand or shrink dynamically as the size of the database grow or shrink. • In dynamic hashing data buckets grow or shrinks(added or removed dynamically) as records increases or decreases. • Dynamic hashing is also called extended hashing.
  • 24.
    Dynamic hashing/extended hashing •It is a dynamic hashing method where in directories &buckets are used to hash the data. • It is aggressively flexible method in which the hash function also experiences dynamic change. • Directories store addresses of the buckets in pointer and id assigned to each directory. Which may change each time when directory expansion takes place. • Buckets are used to hash the actual data.
  • 27.
    Frequently used termsin Extendible Hashing: • Directories: These containers store pointers to buckets. Each directory is given a unique id which may change each time when expansion takes place. • Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more than one pointers to it if its local depth is less than the global depth. • Global Depth: It is associated with the Directories. They denote the number of bits which are used by the hash function to categorize the keys. Global Depth = Number of bits in directory id. • Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is associated with the buckets and not the directories. Local depth in accordance with the global depth is used to decide the action that to be performed in case an overflow occurs. Local Depth is always less than or equal to the Global Depth.
  • 28.
    Frequently used termsin Extendible Hashing: • Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the bucket is split into two parts. • Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory Expansion is performed when the local depth of the overflowing bucket is equal to the global depth.