5266732.ppt

A Decentralized Structure Storage Model
- Avinash Lakshman & Prashanth Malik
- Presented by
Srinidhi Katla
CASSANDRA

Topics covered:
 What Is Cassandra
 Motive
 Data Model
 Architecture
 The After Story
 Applications

Features of Cassandra
 Distributed Storage system
 Manages very large amounts of Data
 Highly available
 No Single point of failure
 Simple data model
 Dynamic control over Data layout and format
 Designed to run on cheap commodity hardware
 Handles high throughput while not sacrificing high
read efficiency

Motives behind Cassandra
 Storage needs of Inbox search problem
o High write throughput
o Increasing number of users
o High search latencies due to data distribution.
 Operational Requirements :
o Scalability
o Handle Hardware failure
 Inbox Search was launched in 2008 for 100 million
users ;
 Is Deployed as backend storage system for multiple
services within FB

Data Model
 Is based on Amazon’s Dynamo and Google’s Big
Table.
 Table : distributed Multi-dimensional map indexed by
a key
 Consists : Row key, Column, Column Family, Super
column Family
 Row Key : Can be considered equivalent to primary
index of the RDBMS.
 Column : is a “name , value, time ” (e.g.,
“color=red”).
 Column Family : Set of columns grouped together
Simple column Family
Super column Family : column family within
column family

Column Family
Image courtesy : http://www.ebaytechblog.com/author/jhpatel/#.VSPslfnF8SM

Column Family (Conti..)
 Access column using convention :
column_family:column
 Super column :
column_family:supercolumn:column

Facebook super column abstraction
 term search :
User Id = row key ;
Terms searched = supercolumn;
Message identifiers of message containing the
word = column
 Interaction
User ID : rowkey;
receipients’ IDs : supercolumn
Individual message identifier = columns

API
 Cassandra has thrift querying :
insert (table, key, row Mutation)
get(table, key, column Name)
delete(table, key, columnName)

Architecture
 Partitioning
 Replication
 Membership and Failure Detection
 Bootstrapping
 Scaling the cluster
 Local Persistence

Partitioning
 Data is partitioned dynamically over the nodes to
aid scaling.
 Implements order preserving consistent
hashing.(CH)
 Through consistent Hashing, coordinator for each
data key is determined.
 Advantages of CH : Departure and Arrival of
node only affects its neighbours.
 Disadvantage of CH : Non-uniform data
distribution . Hashing is unaware of the
heterogenity of the performance of nodes.
 Solution by Cassandra: Lightly loaded nodes
move on the ring to alleviate heavily loaded
nodes.

Replication :
 Required for ensuring High availability and
durability
 Replication Factor “N”
 Coordinator node is responsible for replication of
data at N-1 nodes.
 Replication Policies :
 Rack Unaware : replicated to N-1 successors of
coordinator
 Rack Aware Zookeeper is chosen,
 Data Center Aware informs the nodes what replicas to
store
• Meta data about ranges a node is
responsible for is stored in ZooKeeper as well as
the node.

Membership and Failure
Detection
 Membership is based on Scuttlebutt
 – Gossip based mechanism.
 Efficient CPU utilization
 Efficient utilization of gossip channel
 Used for membership and to disseminate system related
control state
 Failure detection : To check if the node is
available and to avoid attempts to communicate
with the unreachable nodes. – uses Modified ᶲ
Accrual Failure detector
 Failure detection emits suspicion level defined as ᶲ
instead of Boolean value.

Boot Strapping & Scaling
 Token assigned to new node is gossiped among all
the nodes.
 New node is assigned token so as alleviate the
heavily loaded node.
 New node reads the configuration file from the
ZooKeeper.
 Node outages are usually transient => Rebalancing
of partition assignment or repair of unreachable
replicas should be avoided.
 Change of node membership is manual.
 The heavily loaded node splits the data and
responsibility.
 Operational experience shows that the data can be
transferred at a rate of 40 Mbps from single node.

Local Persistence
 Relies on local file system
 Dedicated disk on each machine for commit log to
maximise the disk throughput
 Write :Data is first written to commit log and later to in-
memory data structure.
 After the data limit crosses a threshold value in the in-
memory DS, it is dumped to the disk.
 - Index is created for efficient lookup.
 Many files exist on the disk over time. Merge process
to collate these files into one file. Similar to
compaction process in Big Table.
 Generate index for 256K block for efficient lookup in
columns

Local Persistence (Conti)
 Read: Query the in-memory DS 1st. Then look up
in the disk.
 Files are looked up in the order of new to old.
 Bloom filter to check if the key exists in the file.
 Column indices

Reads and Writes
 Request for a key is routed to a node in the
cluster.
 Node determines the replicas and route request.
 Fail request if the replies are not received within
time.
 For Writes : routes request to replica and waits
for a quorun of replicas to acknowledge the
completion of writes
 For Reads : Based on Client set consistency
guarantee value, request is routed to either the
closest replica or request is routed to all replicas
and wait for the quorum of responses

Implementation
 Cassandra on each machine – partitioning
module, cluster membership, failure detection,
storage engine
 Implemented ground up using Java
 Purge commit log entries using rolling commit log
mechanism for 128 MB chunk.
 In memory DS and datafile for every column
family
 All writes to disk are sequential to maximize the
throughput
 No locks since the files dumped to the disk are
not mutated.

The After Story
 It was released as an open source project on Google
code in July 2008 which is now being developed and
marketed by Apache as Apache Cassandra
(henceforth referred as Cassandra in this slide).
 In Apache Cassandra, Super columns are stripped
due to performance issues. Instead composite column
is introduced
 Cassandra Query Language presents a data model
familiar to relational database users.
 Cassandra partitioning is still based on consistent
hashing, but has moved away from load balancing in
favor of virtual nodes,
 Order preserving hash function was ripped in favor of
a true OrderedPartitioner (later superseded by
ByteOrderedPartitioner).

The After Story (Conti..)
 In modern Cassandra terminology, the
coordinator is the node that processes a given
client’s request and routes it to the appropriate
replicas; it is not necessarily itself a replica.
 Zookeeper usage was restricted to Facebook’s in-
house Cassandra branch;
 Modern Cassandra management tools
include DataStax’s OpsCenter and Netflix’s
Priam.

Big Players
 Facebook Inbox search feature was implemented
on Cassandra where every user is an index and
the recipient and messages are stored as
columns. The sytem currently stores more than
50 TB of data on a 150 node cluster with a
median search latency of approximately 15 ms.
 Netflix, a video streaming firm stores 95% of its
data in Cassandra
 Ebay has implemented Cassandra for the
features like counts for “own” “want” “like” data on
its web page.
 Coursera, an online training service, has
Cassandra implemented for its mobile
applications

References:
 http://www.ebaytechblog.com/author/jhpatel/#.VS
PslfnF8SM
 http://www.divconq.com/2010/cassandra-
columns-and-supercolumns-and-rows/
 http://docs.datastax.com/en/articles/cassandra/ca
ssandrathenandnow.html

5266732.ppt

More Related Content

Similar to 5266732.ppt

Recently uploaded

5266732.ppt

Editor's Notes