Data Deduplication: Venti and its improvements

Data Deduplication: Venti and its improvements
Umair Amjad
12-5044
umairamjadawan@gmail.com
Department of Computer Science, National University of Computer and Emerging Sciences, Pakistan
Abstract
Entire world is adapting digital technologies, converting from legacy approach to Digital approach. Data is
the primary thing which is available in digital form everywhere. To store this massive data, the storage
methodology should be efficient as well as intelligent enough to find the redundant data to save. Data
deduplication techniques are widely used by storage servers to eliminate the possibilities of storing
multiple copies of the data. Deduplication identifies duplicate data portions going to be stored in storage
systems also removes duplication in existing stored data in storage systems. Hence yield a significant
cost saving. This paper is about data deduplication, taking Venti as base case discussed it in detail and
also identify area of improvements in Venti which are addressed by other papers.
Keywords – Data deduplication; data storage; hash index; venti; archival data;
1. Introduction
The world is producing the large number of
digital data that is growing rapidly. According to a
study, the information producing per year to the
digital universe is growing by 57% annually. This
whopping growth of information is imparting a
considerable load on storage systems. Thirty-five
percent of this information is generated by
enterprises and therefore must be retained due to
regulatory compliance and legal reasons. So it is
critical to backup the data regularly to a disaster
recovery site for data availability and integrity.
Rapidly developing data arises many challenges to
the existing storage systems. One observation is
that a significant fraction of information contains
duplicates, due to reasons such as backups,
copies, and version updates. Thus, deduplication
techniques have been invented to avoid storing
redundant information.
A number of trends have motivated the
creation of deduplication solutions. Archival
systems such as Venti have identified significant
information redundancy within and across
machines due to update versions and commonly
installed applications and libraries. In addition to
storage overhead, duplicate file content can also
have other negative effects on the system. As files
are accessed, they are cached in memory and in
the hard disk cache. Duplicate content can
consume unnecessary memory cache that could
be used to cache additional unique content.
Deduplication solves these issues by locating
identical content and handling it appropriately.
Instead of storing the same file content multiple
times, we can have a new file that references the
identical content already stored in the system. The
use of deduplication results in more efficient use of
both memory cache and storage capacity.
This paper is taking Venti as base case for data
deduplication and its missing areas. After
identification of missing areas there solution is
proposed in reference to other research papers.
2. Background
In storage archives a large quantity of data
is redundant and slight changed to another chunk
of data. The term data deduplication points to the
techniques that saves only one single instance of
replicated data, and provide links to that instance
of copy in place of storing other original copies of
this data. There are many techniques exists for
eliminating redundancy from the stored data. At
present data deduplication has gained popularity in
the research community . Data deduplication is a
specialized data compression technique for
eliminating redundant data, typically to improve

storage utilization . In the deduplication process ,
redundant data is left and not stored.
By the evolution of services from tape to
disk, data deduplication has turn into a key
element in the backup process. It specifies that
only one copy of that data is saved in the
datacenter. Every user, who want to access that
copy linked to that single instance of copy. So it is
clear that data deduplication help to decrease the
size of data center. So it could say that
deduplication means that the number of the
replication of data that were usually duplicated on
the cloud should be controlled and managed to
shrink the physical storage space requested for
such replications. The basic steps for deduplication
are:
1. In first step files are divided into small
segments.
2. After the segment creation new and the
existing data are checked for similarity by
comparing fingerprints created by hashing
algorithm.
3. Then Metadata structures are updated.
4. Segments are compressed.
5. All the duplicate data is deleted and data
integrity check is performed.
2.1 Types of Data Deduplication
There are two major categories of data
deduplication on which all research is based.
1. Offline Data deduplication(Target based): In an
offline deduplication state, first data is written to the
storage disk and deduplication process take place
at a later time. It is performed on the target data
storage center. In this case the client is unmodified
and not aware of any deduplication. This
technology improves storage utilization and no one
need to wait for hash based calculations, but does
not save bandwidth.
2. Online Data deduplication(Source based): In an
online deduplication state, replicate data is deleted
before being written to the storage disk. It is
performed on the data at the source before it’s
transferred. A deduplication aware backup agent is
installed on the client which backs up only unique
data. The result is increased bandwidth and
storage efficiency. But, this enforces extra
computational load on the backup client.
Replicates are changed by pointers and the actual
replicate data is never sent over the network.
Once the timing of data deduplication has
been decided then there are number of existing
techniques that can be apply. The most used
deduplication approaches are file level hashing
and block level hashing.
1. File Level hashing : In a file level hashing
technique, the whole file is directed to a hashing
function. The hashing function is always
cryptographic hash like MD5 or SHA-1. The
cryptographic hash is used to find entire replicate
files. This approach is speedy with low
computation and low additional meta data
overhead. It works very well for complete system
backups when total duplicate files are more
common. However, the larger granularity of
replicate matching stops it from matching two files
that only differ by one single byte or bit of data.
2. Block Level Hashing: It means the file is broken
into a number of smaller sections before data
deduplication. The number of sections depends on
the type of approach that is being used. The two
most common types of block level hashing are
fixed-size chunking and variable-length chunking.
In a fixed-size chunking approach, a file is divided
up into a number of fixed-size pieces called
chunks. In a variable-length chunking approach, a
file is broken up into chunks of variable length.
Each section is passed to a cryptographic hash
function (usually MD5 or SHA-1) to get the chunk
identifier. The chunk identifier is used to locate
replicate data.

File internal changes, will cause the entire
file need to store. PPT and other files may need to
change some simple content, such as changing
the page to display the new report or the dates,
which can lead to re-store the entire document.
Block level data deduplication technology stores
only one version of the paper and the next part of
the changes between versions. File level
technology, generally less than 5:1 compression
ratio, while the block-level storage technology can
compress the data capacity of 20: 1 or even 50: 1.
2.2 Methodologies of Deduplication
At present, the research of deduplication
focuses on two aspects. One is to remove the
duplicate data as much as possible and then
reduce the storage capacity requirement. The
other is the efficiency in the resources required to
achieve. Most of the available traditional backup
systems use file-level deduplication. However the
data deduplication technology can exploit inter-file
and intra-file information redundancy to eliminate
duplicate or similarity data at the granularity block
or byte. Some of the available architecture follows
the source deduplication. However because of this
approach, user has to face delay in sending data to
the backup store, and the rest of the available
architectures which support target deduplication
strategy provide single system deduplication that
means at the target side only single system
(Server) handles all the user requests to store data
and maintains the hash index for the number of
disks attached to it.
Venti: It is a network storage system. It applies
identical hash values to find block contents so that
it decreases the data occupation of storage area.
Venti generates blocks for huge storage
applications and inspire a write-once policy to
avoid collision of the data. This network storage
system emerged in the early stages of network
storage, so it is not suitable to deal with avast data,
and the system is not scalable.
3. Venti as a base case
The key idea behind Venti, is to identify
data blocks by a hash of their contents, also called
fingerprint in this paper. Fingerprint is the source
for all the obvious benefits of Venti. As blocks are
addressed by the fingerprint of their contents, a
block cannot be modified without changing its
address (write-once behavior). Writes are
idempotent, since multiple writes of the same data
can be coalesced and do not require additional
storage. Without cooperating or coordinating,
multiple clients can share the data blocks with
Venti server.
Inherent integrity checking of data is
ensured, since both the client and the server can
compute the fingerprint of the data and compare it
to the requested fingerprint, when a block is
retrieved; and Features like replication, caching,
and load balancing are facilitated; because the
contents of a particular block are immutable, the
problem of data coherency is greatly reduced. The
main challenge of the work, on the other hand, is
also brought about by hashing. The design of Venti
requires a hash function that could generate a
unique fingerprint for every data block that a client
may want to store. Venti employs a cryptographic
hash function, Sha1, for which it is computationally
infeasible to find two distinct inputs that hash to the
same value. (To date, there are no known
collisions with Sha1.) As to the choice of storage
technology, the authors make a good enough
argument to use magnetic disks, by comparing the
prices and performance of disks and optical
storage systems.

Each block is prefixed by a header that
describes the contents of the block. The primary
purpose of the header is to provide integrity
checking during normal operation and to assist in
data recovery. The header includes a magic
number, the fingerprint and size of the block, the
time when the block was first written, and identity
of the user that wrote it. The header also includes
a user-supplied type identifier, which is explained
in Section 7. Note, only one copy of a given block
is stored in the log, thus the user and time fields
correspond to the first time the block was stored to
the server. The encoding field in the block header
indicates whether the data was compressed and, if
so, the algorithm used. The e-size field indicates
the size of the data after compression, enabling the
location of the next block in the arena to be
determined.
In addition to a log of data blocks, an
arena includes a header, a directory, and a trailer.
The header identifies the arena. The directory
contains a copy of the block header and offset for
every block in the arena. By replicating the
headers of all the blocks in one relatively small part
of the arena, the server can rapidly check or
rebuild the system's global block index. The
directory also facilitates error recovery if part of the
arena is destroyed or corrupted. The trailer
summarizes the current state of the arena itself,
including the number of blocks and the size of the
log. Within the arena, the data log and the directory
start at opposite ends and grow towards each
other. When the arena is filled, it is marked as
sealed, and a fingerprint is computed for the
contents of the entire arena. Sealed arenas are
never modified.
The basic operation of Venti is to store and
retrieve blocks based on their fingerprints. A
fingerprint is 160 bits long, and the number of
possible fingerprints far exceeds the number of
blocks stored on a server. The disparity between
the number of fingerprints and blocks means it is
impractical to map the fingerprint directly to a
location on a storage device. Instead, we use an
index to locate a block within the log. Index is
implemented using a disk-resident hash table. The
index is divided into fixed-sized buckets, each of
which is stored as a single disk block. Each bucket
contains the index map for a small section of the
fingerprint space. A hash function is used to map
fingerprints to index buckets in a roughly uniform
manner, and then the bucket is examined using
binary search. This structure is simple and
efficient, requiring one disk access to locate a
block in almost all cases.
Three applications, Vac, physical backup, and
usage with Plan 9 file system, are demonstrated to
show the effectiveness of Venti. In addition to the
development of the Venti prototype, a collection of
tools for integrity checking and error recovery were
built. The authors also gave some preliminary
performance results for read and write operations
with the Venti prototype. By using disks, they've
shown an access time for archival data that is
comparable to non-archival data. However, they
also indicated the main problem: the uncached
sequential read performance is particularly bad,
due to the requirement of random read of the index
of the sequential reads. They've pointed it out one
possible solution: read-ahead.
4. Improvements in Venti
There are three parameters which are identified in
Venti paper, those required improvement.

4.1 Hashing Collision:
'A Comparison Study of Deduplication
Implementations with Small-Scale Workloads'
solves the problem of venti which is to have hash
collision. The design of Venti requires a hash
function that generates a unique fingerprint for
every data block that a client may want to store.
For a server of a given capacity, the likelihood that
two different blocks will have the same hash value,
also known as a collision can be determined.
Although probability to have identical values of key
is extremely low but still to make sure, Small-Scale
Workloads use both encryption algorithms SHA256
and MD5 simultaneously. Each of the hash
functions maps to one of two hash tables.
4.2 Fix size chunking:
'A Low-bandwidth Network File System'
named as LBFS addresses this problem by
considering only non-overlapping chunks of files
and avoids sensitivity to shifting file offsets by
setting chunk boundaries based on file contents,
rather than on position within a file. Insertions and
deletions therefore only affect the surrounding
chunks. To divide a file into chunks, LBFS
examines every (overlapping) 48-byte region of the
file and with probability each region’s contents
considers it to be the end of a data chunk. LBFS
selects these boundary regions called breakpoints
using Rabin fingerprints. Figure shows how LBFS
might divide up a file and what happens to chunk
boundaries after a series of edits.
1. shows the original file, divided into variable
length chunks with breakpoints determined by a
hash of each 48-byte region.
2. shows the effects of inserting some text into the
file. The text is inserted in chunk c4 , producing a
new, larger chunk c8 . However, all other chunks
remain the same. Thus, one need only send c8 to
transfer the new file to a recipient that already has
the old version.
4.3 Better Access Control:
'A Low-bandwidth Network File System'
uses RPC library which support for authenticating
and encrypting traffic between a client and server.
The entire LBFS protocol, RPC headers and all, is
passed through gzip compression, tagged with a
message authentication code, and then encrypted.
At mount time, the client and server negotiate a
session key, the server authenticates itself to the
user, and the user authenticates herself to the
client, all using public key cryptography. We added
support for compression. The client and server
communicate over TCP using Sun RPC.
'POTSHARDS: Secure Long-Term
Archival Storage Without Encryption' uses secret
splitting and approximate pointers as a way to
move security from encryption to authentication
and to avoid reliance on encryption algorithms that
may be compromised at some point in the future.
Unlike encryption, secret splitting provides
information-theoretic security. Second, each user
maintains a separate, recoverable index over her
data, so a compromised index does not affect the
other users and a lost index is not equivalent to
data deletion. More importantly, in the event that a
user loses her index, both the index and the data
itself can be securely reconstructed from the user’s
shares stored across multiple archives.

5. Conclusion
Archival data is growing exponentially so it
is much needed to have system which can
eliminate data duplication in a best way. Although
paper have eloborated Venti in depth and its
improvement areas; three major issues of Venti are
discussed but there may be the cases when these
proposed solutions may fail. For hash case may
occurs when SHA and MD5 both create duplicate
keys. Similarly in second part, content based
chunking is high computational task so it can be
avoid by further improvement. Venti is not
experimented on distributed enviorement so that
can be the idea candidate for future work.
6. References
[1] "Deduplication and Compression Techniques
in Cloud Design" by Amrita Upadhyay, Pratibha R
Balihalli, Shashibhushan Ivaturi and Shrisha Rao
2012 IEEE
[2] "Avoiding the Disk Bottleneck in the Data
Domain Deduplication File System" by Benjamin
Zhu Data Domain, Inc. 6th USENIX Conference on
File and Storage Technologies
[3] P. Kulkarni, J. LaVoie, F. Douglis and J.
Tracey
Redundancy elimination within large collections of
files. On 2004 in Proc. USENIX 2004 Annual
Technical Conference.
[4] Dave Russell: Data De-duplication Will Be
Even Bigger in 2010, Gartner, 8 February 2010.
[5] Mark W. Storer, Kevin M. Greenan, Darrell D.
E. Long and Ethan L. Miller. Secure data
deduplication. In Proceedings of the 2008 ACM
Workshop on Storage Security and Survivability,
October 2008.
[6] “Fujitsu’s storage systems and related
technologies supporting cloud computing,” 2010.
[Online]. Available: http://www.fujitsu.com/global/
[7] Q. Sean and D. Sean, Venti: A New Approach
to Archival Data Storage, in Proceedings of the 1st
USENIX Conference on File and Storage
Technologies, ed. Monterey, CA: USE- NIX
Association, 2002, pp. 89-101.
[8] D. Bhagwat, K. Eshghi, D.D.E. Long and M.
Lillibridge, Extreme Binning: Scalable, Parallel
Deduplication for Chunk- based File Backup, in
2009 IEEE International Symposium on Modeling,
Analysis and Simulation of Computer and
Telecommunication Systems Mascots, 2009, pp.
237-245.
[9] J. Black. Compare-by-hash: A reasoned
analysis, in USENIX Association Proceedings of
the 2006 USENIX Annual Technical Conference,
2006, pp. 85-90.
[10] D. Borthakur, The Hadoop Distributed File
System: Architecture and Design, 2007.
URL:hadoop.apache.org/hdfs/docs/current/hdfs_de
sign.pdf, accessed in Oct 2011.

Data Deduplication: Venti and its improvements

More Related Content

What's hot

Similar to Data Deduplication: Venti and its improvements

More from Umair Amjad

Recently uploaded

Data Deduplication: Venti and its improvements