Lecture 3 - Data Structure File Organization

Data Structures
Unit III | L3: File Organization

Dr. Krishnendu Rarhi
Heap
• A Heap is a complete binary tree data structure that satisfies the heap
property: for every node, the value of its children is greater than or
equal to its own value. Heaps are usually used to implement priority
queues, where the smallest (or largest) element is always at the root
of the tree

Heap Properties
• The minimum or maximum element is always at the root of the heap,
allowing constant-time access.
• The relationship between a parent node at index ‘i’ and its children is
given by the formulas: left child at index 2i+1 and right child at index 2i+2
for 0-based indexing of node numbers.
• As the tree is complete binary, all levels are filled except possibly the last
level. And the last level is filled from left to right.
• When we insert an item, we insert it at the last available slot and then
rearrange the nodes so that the heap property is maintained.
• When we remove an item, we swap root with the last node to make sure
either the max or min item is removed. Then we rearrange the remaining
nodes to ensure heap property (max or min)

Heapify
• It is the process to rearrange the elements to maintain the
property of heap data structure. It is done when root is
removed (we replace root with the last node and then call
heapify to ensure that heap property is maintained) or heap is
built (we call heapify from the last internal node to root) to
make sure that the heap property is maintained. This
operation also takes O(log n) time.
• For max-heap, it makes sure the maximum element is the root of
that binary tree and all descendants also follow the same property.
• For min-heap, it balances in such a way that the minimum element
is the root and all descendants also follow the same property.

Heap- Insertion
• If we insert a new element into the heap since we are
adding a new element into the heap so it will distort the
properties of the heap so we need to perform the heapify
operation so that it maintains the property of the heap. This
operation also takes O(log n) time.

Heap- Insertion
Assume initially heap(taking max-heap) is as follows
8
/
4 5
/
1 2
Now if we insert 10 into the heap
8
/
4 5
/ /
1 2 10
After repeatedly comparing with the parent nodes and
swapping if required, the final heap will be look like this
10
/
4 8
/ /
1 2 5

Heap- Deletion
• If we delete the element from the heap it always deletes the
root element of the tree and replaces it with the last
element of the tree.
• Since we delete the root element from the heap it will
distort the properties of the heap so we need to perform
heapify operations so that it maintains the property of the
heap.
• It takes O(log n) time

Heap- Deletion
Assume initially heap(taking max-heap) is as follows
15
/
5 7
/
2 3
Now if we delete 15 into the heap it will be
replaced by leaf node of the tree for temporary.
3
/
5 7
/
2
After heapify operation final heap will be look like this
7
/
5 3
/
2

Heap- Operations
• getMax (For max-heap) or getMin (For min-heap):
• It finds the maximum element or minimum element for max-heap
and min-heap respectively and as we know minimum and
maximum elements will always be the root node itself for min-
heap and max-heap respectively. It takes O(1) time.
• removeMin or removeMax:
• This operation returns and deletes the maximum element and
minimum element from the max-heap and min-heap respectively.
In short, it deletes the root element of the heap binary tree.

Heap- Implementation
• maxHeapify is the function responsible for restoring the property of the
Max Heap. It arranges the node i, and its subtrees accordingly so that the
heap property is maintained.
• Suppose we are given an array, arr[] representing the complete binary tree. The
left and the right child of ith
node are in indices 2*i+1 and 2*i+2.
• We set the index of the current element, i, as the ‘MAXIMUM’.
• If arr[2 * i + 1] > arr[i], i.e., the left child is larger than the current value, it is set as
‘MAXIMUM’.
• Similarly if arr[2 * i + 2] > arr[i], i.e., the right child is larger than the current value,
it is set as ‘MAXIMUM’.
• Swap the ‘MAXIMUM’ with the current element.
• Repeat steps 2 to 5 till the property of the heap is restored.

Heap- Advantages
• Time Efficient: Heaps have an average time complexity of O(log n) for inserting and deleting
elements, making them efficient for large datasets. We can convert any array to a heap in
O(n) time. The most important thing is, we can get the min or max in O(1) time
• Space Efficient : A Heap tree is a complete binary tree, therefore can be stored in an array
without wastage of space.
• Dynamic: Heaps can be dynamically resized as elements are inserted or deleted, making
them suitable for dynamic applications that require adding or removing elements in real-
time.
• Priority-based: Heaps allow elements to be processed based on priority, making them
suitable for real-time applications, such as load balancing, medical applications, and stock
market analysis.
• In-place: Most of the applications of heap require in-place rearrangements of elements. For
example HeapSort.

Heap- Disadvantages
• Lack of flexibility: The heap data structure is not very flexible, as it is designed to
maintain a specific order of elements. This means that it may not be suitable for some
applications that require more flexible data structures.
• Not ideal for searching: While the heap data structure allows efficient access to the top
element, it is not ideal for searching for a specific element in the heap. Searching for an
element in a heap requires traversing the entire tree, which has a time complexity of
O(n).
• Not a stable data structure: The heap data structure is not a stable data structure, which
means that the relative order of equal elements may not be preserved when the heap is
constructed or modified.
• Complexity: While the heap data structure allows efficient insertion, deletion, and
priority queue implementation, it has a worst-case time complexity of O(n log n), which
may not be optimal for some applications that require faster algorithms.

Heap- Applications
• Priority Queues: Heaps are commonly used to implement priority queues, where
elements with higher priority are extracted first. This is useful in many applications such
as scheduling tasks, handling interruptions, and processing events.
• Sorting Algorithms: Heapsort, a comparison-based sorting algorithm, is implemented
using the Heap data structure. It has a time complexity of O(n log n), making it efficient
for large datasets.
• Graph algorithms: Heaps are used in graph algorithms such as Prim’s Algorithm,
Dijkstra’s algorithm., and the A* search algorithm.
• Lossless Compression: Heaps are used in data compression algorithms such as Huffman
coding, which uses a priority queue implemented as a min-heap to build a Huffman tree.
• Medical Applications: In medical applications, heaps are used to store and manage
patient information based on priority, such as vital signs, treatments, and test results

Heap- Applications
• Load balancing: Heaps are used in load balancing algorithms to distribute tasks or
requests to servers, by processing elements with the lowest load first.
• Order statistics: The Heap data structure can be used to efficiently find the kth
smallest (or largest) element in an array. See method 4 and 6 of this post for
details.
• Resource allocation: Heaps can be used to efficiently allocate resources in a
system, such as memory blocks or CPU time, by assigning a priority to each
resource and processing requests in order of priority.
• Job scheduling: The heap data structure is used in job scheduling algorithms,
where tasks are scheduled based on their priority or deadline. The heap data
structure allows efficient access to the highest-priority task, making it a useful data
structure for job scheduling applications.

Heap- Comparison
S.N
o
Heap Tree
1 Heap is a kind of Tree itself. The tree is not a kind of heap.
2
Usually, Heap is of two types, Max-Heap and Min-
Heap.
Whereas a Tree can be of various types for eg.
binary Tree, BST(Binary Search tree), AVL tree,
etc.
3 Heap is ordered. Binary Tree is not ordered but BST is ordered.
4
Insert and remove will take O(log(N)) time in the
worst case.
Insert and remove will take O(N) in the worst
case in case the tree is skewed.
5
Finding Min/Max value in Heap is O(1) in the
respective Min/Max heap.
Finding Min/Max value in BST is O(log(N)) and
Binary Tree is O(N).
6 Heap can also be referred to as Priority Queue.
A tree can also be referred to as a connected
undirected graph with no cycle.
7 Heap can be built in linear time complexity. BST: O(N * log(N)) and Binary Tree: O(N).
8
Applications: Prim’s Algorithm and Dijkstra’s
algorithm.
Applications: Spanning Trees, Trie, B+ Tree,
BST, Heap.

Heap Sort
• Heap sort is a comparison-based sorting technique based on Binary
Heap Data Structure. It can be seen as an optimization over selection
sort where we first find the max (or min) element and swap it with
the last (or first). We repeat the same process for the remaining
elements. In Heap Sort, we use Binary Heap so that we can quickly
find and move the max element in O(Log n) instead of O(n) and hence
achieve the O(n Log n) time complexity

Algorithm
• First convert the array into a max heap using heapify, Please note that this
happens in-place. The array elements are re-arranged to follow heap properties.
Then one by one delete the root node of the Max-heap and replace it with the
last node and heapify. Repeat this process while size of heap is greater than 1.
• Rearrange array elements so that they form a Max Heap.
• Repeat the following steps until the heap contains only one element:
• Swap the root element of the heap (which is the largest element in current heap) with the
last element of the heap.
• Remove the last element of the heap (which is now in the correct position). We mainly
reduce heap size and do not remove element from the actual array.
• Heapify the remaining elements of the heap.
• Finally we get sorted array.

Algorithm
Step 1: Treat the Array as a Complete Binary Tree
• We first need to visualize the array as a complete binary tree. For an
array of size n, the root is at index 0, the left child of an element at
index i is at 2i + 1, and the right child is at 2i + 2.

Algorithm
Step 2: Build a Max Heap

Algorithm
Step 3: Sort the array by placing largest element at end of unsorted
array

Hashing
• Hashing is a technique used in data structures that efficiently stores
and retrieves data in a way that allows for quick access. It involves
mapping data to a specific index in a hash table using a hash function
that enables fast retrieval of information based on its key. This
method is commonly used in databases, caching systems, and various
programming applications to optimize search and retrieval operations.
The great thing about hashing is, we can achieve all three operations
(search, insert and delete) in O(1) time on average.

Hashing- Need
• Array Efficiency: While arrays allow data storage in constant time
O(1), searching through them takes at least O(log n) time. For large
datasets, this can be inefficient.
• Need for Improvement: Despite being useful, the inefficiency of
arrays in search operations led to the need for a more effective data
structure, especially for handling large volumes of data.
• Hashing Solution: Hashing offers a more efficient alternative by
allowing both storage and retrieval of data in constant time O(1),
greatly improving performance for large datasets.

Hashing- Components
• Key: A Key can be anything string or integer which is fed as input in
the hash function the technique that determines an index or location
for storage of an item in a data structure.
• Hash Function: The hash function receives the input key and returns
the index of an element in an array called a hash table. The index is
known as the hash index .
• Hash Table: Hash table is a data structure that maps keys to values
using a special function called a hash function. Hash stores the data in
an associative manner in an array where each data value has its own
unique index.

Hashing- Components

Hashing- Working
• Suppose we have a set of strings {“ab”, “cd”, “efg”} and we would like
to store it in a table.
• Our main objective here is to search or update the values stored in
the table quickly in O(1) time and we are not concerned about the
ordering of strings in the table. So the given set of strings can act as a
key and the string itself will act as the value of the string but how to
store the value corresponding to the key?

Hashing- Working
Step 1: We know that hash functions (which is some mathematical formula) are
used to calculate the hash value which acts as the index of the data structure where
the value will be stored.
Step 2: So, let’s assign
“a” = 1,
“b”=2, .. etc, to all alphabetical characters.
Step 3: Therefore, the numerical value by summation of all characters of the string:
“ab” = 1 + 2 = 3,
“cd” = 3 + 4 = 7 ,
“efg” = 5 + 6 + 7 = 18

Hashing- Working
Step 4: Now, assume that we have a table of size 7 to store these
strings. The hash function that is used here is the sum of the characters
in key mod Table size . We can compute the location of the string in the
array by taking the sum(string) mod 7 .
Step 5: So we will then store
“ab” in 3 mod 7 = 3,
“cd” in 7 mod 7 = 0, and
“efg” in 18 mod 7 = 4.

Hash Function
• A hash function is a function that takes an input (or ‘message’) and
returns a fixed-size string of bytes. The output, typically a number, is
called the hash code or hash value. The main purpose of a hash
function is to efficiently map data of arbitrary size to fixed-size values,
which are often used as indexes in hash tables
• For example: Consider an array as a Map where the key is the index
and the value is the value at that index. So for an array A if we have
index i which will be treated as the key then we can find the value by
simply looking at the value at A[i]

Hash Function- Properties
• Deterministic: A hash function must consistently produce the same output for the same input.
• Fixed Output Size: The output of a hash function should have a fixed size, regardless of the size
of the input.
• Efficiency: The hash function should be able to process input quickly.
• Uniformity: The hash function should distribute the hash values uniformly across the output
space to avoid clustering.
• Pre-image Resistance: It should be computationally infeasible to reverse the hash function, i.e.,
to find the original input given a hash value.
• Collision Resistance: It should be difficult to find two different inputs that produce the same
hash value.
• Avalanche Effect: A small change in the input should produce a significantly different hash
value.

Hash Function- Applications
• Hash Tables: The most common use of hash functions in DSA is in
hash tables, which provide an efficient way to store and retrieve data.
• Data Integrity: Hash functions are used to ensure the integrity of data
by generating checksums.
• Cryptography: In cryptographic applications, hash functions are used
to create secure hash algorithms like SHA-256.
• Data Structures: Hash functions are utilized in various data structures
such as Bloom filters and hash sets.

Hash Function- Types
• Division Method.
• Multiplication Method
• Mid-Square Method
• Folding Method
• Cryptographic Hash Functions
• Universal Hashing
• Perfect Hashing

Collision in Hashing
• Collision in Hashing occurs when two different keys map to the same
hash value. Hash collisions can be intentionally created for many hash
algorithms. The probability of a hash collision depends on the size of
the algorithm, the distribution of hash values and the efficiency of
Hash function.
• The hashing process generates a small number for a big key, so there
is a possibility that two keys could produce the same value. The
situation where the newly inserted key maps to an already occupied,
and it must be handled using some collision handling technology.

Collision in Hashing

Collision Handling
• Separate Chaining: The idea is to make each cell of the hash table
point to a linked list of records that have the same hash function
value. Chaining is simple but requires additional memory outside the
table.
• Example: We have given a hash function and we have to insert some
elements in the hash table using a separate chaining method for
collision resolution technique.
Hash function = key % 5,
Elements = 12, 15, 22, 25 and 37.

Collision Handling
• Separate Chaining

Collision Handling
• Open Addressing: In open addressing, all elements are stored in the
hash table itself. Each table entry contains either a record or NIL.
When searching for an element, we examine the table slots one by
one until the desired element is found or it is clear that the element is
not in the table.
• Linear Probing: In linear probing, the hash table is searched sequentially that
starts from the original location of the hash. If in case the location that we get
is already occupied, then we check for the next location.

Collision Handling
• Linear Probing
• Calculate the hash key. i.e. key = data % size
• Check, if hashTable[key] is empty
• store the value directly by hashTable[key] = data
• If the hash index already has some value then
• check for next index using key = (key+1) % size
• Check, if the next index is available hashTable[key] then store the value.
Otherwise try for next index.
• Do the above process till we find the space.

Collision Handling
• Quadratic Probing: Quadratic probing is an open addressing scheme
in computer programming for resolving hash collisions in hash tables.
Quadratic probing operates by taking the original hash index and
adding successive values of an arbitrary quadratic polynomial until an
open slot is found
• Example: H + 1 2
, H + 2 2
, H + 3 2
, H + 4 2
…………………. H + k 2
• This method is also known as the mid-square method because in this
method we look for i 2
‘th probe (slot) in i’th iteration and the value of
i = 0, 1, . . . n – 1. We always start from the original hash location. If
only the location is occupied then we check the other slots

Collision Handling
• Quadratic Probing:
Let hash(x) be the slot index computed using the hash function and n
be the size of the hash table.
If the slot hash(x) % n is full, then we try (hash(x) + 1 2
) % n.
If (hash(x) + 1 2
) % n is also full, then we try (hash(x) + 2 2
) % n.
If (hash(x) + 2 2
) % n is also full, then we try (hash(x) + 3 2
) % n.
This process will be repeated for all the values of i until an empty slot is found

Collision Handling
• Double Probing: Double hashing is a collision resolving technique in
Open Addressed Hash tables. Double hashing make use of two hash
function
• The first hash function is h1(k) which takes the key and gives out a location on
the hash table. But if the new location is not occupied or empty then we can
easily place our key.
• But in case the location is occupied (collision) we will use secondary hash-
function h2(k) in combination with the first hash-function h1(k) to find the new
location on the hash table.
• This combination of hash functions is of the form h(k, i) = (h1(k) + i * h2(k)) % n
where, i is a non-negative integer that indicates a collision number, k = element/key which is being
hashed, n = hash table size.

File Organization
• The File is a collection of records. Using the primary key, we can access the records.
The type and frequency of access can be determined by the type of file organization
which was used for a given set of records.
• File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
• File organization is used to describe the way in which the records are stored in terms
of blocks, and the blocks are placed on the storage medium.
• The first approach to map the database to the file is to use the several files and store
only one fixed length record in any given file. An alternative approach is to structure
our files so that we can contain multiple lengths for records.
• Files of fixed length records are easier to implement than the files of variable length
records.

File Organization- Objective
• It contains an optimal selection of records, i.e., records can be
selected as fast as possible.
• To perform insert, delete or update transaction on the records should
be quick and easy.
• The duplicate records cannot be induced as a result of insert, update
or delete.
• For the minimal cost of storage, records should be stored efficiently.

File Organization- Type

Sequential File Organization
Pile File Method:
• It is a quite simple method. In this method, we store the record in a
sequence, i.e., one after another. Here, the record will be inserted in
the order in which they are inserted into tables.
• In case of updating or deleting of any record, the record will be
searched in the memory blocks. When it is found, then it will be
marked for deleting, and the new record is inserted.

Pile File Method:
• Insertion of the new record: Suppose we have four records R1, R3
and so on upto R9 and R8 in a sequence. Hence, records are nothing
but a row in the table. Suppose we want to insert a new record R2 in
the sequence, then it will be placed at the end of the file. Here,
records are nothing but a row in any table.

Sorted File Method:
• In this method, the new record is always inserted at the file's end, and
then it will sort the sequence in ascending or descending order.
Sorting of records is based on any primary key or any other key.
• In the case of modification of any record, it will update the record and
then sort the file, and lastly, the updated record is placed in the right
place.

Sorted File Method:
• Insertion of the new record: Suppose there is a preexisting sorted
sequence of four records R1, R3 and so on upto R6 and R7. Suppose a
new record R2 has to be inserted in the sequence, then it will be
inserted at the end of the file, and then it will sort the sequence.

Advantages:
• It contains a fast and efficient method for the huge amount of data.
• In this method, files can be easily stored in cheaper storage
mechanism like magnetic tapes.
• It is simple in design. It requires no much effort to store the data.
• This method is used when most of the records have to be accessed
like grade calculation of a student, generating the salary slip, etc.
• This method is used for report generation or statistical calculations.

Disadvantages:
• It will waste time as we cannot jump on a particular record that is
required but we have to move sequentially which takes our time.
• Sorted file method takes more time and space for sorting the records.

• Indexed file organization stores the record sequentially depending on
the value of the RECORD-KEY(generally in ascending order). A
RECORD-KEY in an Indexed file is a variable that must be part of the
record/data. In the case of Indexed files two types of files are created:
1. Data file: It consists of the records in sequential order.
2. Index file: It consists of the RECORD-KEY and the address of the RECORD-
KEY in the data file.
• The Indexed file can be accessed sequentially same as Sequential file
organization as well as randomly only if the RECORD-KEY is known.

• Relative file organization stores the record on the basis of their
relative address. Each record is identified by its Relative Record
Number, a Relative Record Number is the position of the record from
the beginning of the file. These records can be accessed sequentially
same as Sequential file organization as well as randomly, to access
files randomly the user must specify the relative record number.

Lecture 3 - Data Structure File Organization

More Related Content

Similar to Lecture 3 - Data Structure File Organization

More from KrishnenduRarhi

Recently uploaded

Lecture 3 - Data Structure File Organization