A Decentralized Structure Storage Model
- Avinash Lakshman & Prashanth Malik
- Presented by
Srinidhi Katla
CASSANDRA
Topics covered:
 What Is Cassandra
 Motive
 Data Model
 Architecture
 The After Story
 Applications
Features of Cassandra
 Distributed Storage system
 Manages very large amounts of Data
 Highly available
 No Single point of failure
 Simple data model
 Dynamic control over Data layout and format
 Designed to run on cheap commodity hardware
 Handles high throughput while not sacrificing high
read efficiency
Motives behind Cassandra
 Storage needs of Inbox search problem
o High write throughput
o Increasing number of users
o High search latencies due to data distribution.
 Operational Requirements :
o Scalability
o Handle Hardware failure
 Inbox Search was launched in 2008 for 100 million
users ;
 Is Deployed as backend storage system for multiple
services within FB
Data Model
 Is based on Amazon’s Dynamo and Google’s Big
Table.
 Table : distributed Multi-dimensional map indexed by
a key
 Consists : Row key, Column, Column Family, Super
column Family
 Row Key : Can be considered equivalent to primary
index of the RDBMS.
 Column : is a “name , value, time ” (e.g.,
“color=red”).
 Column Family : Set of columns grouped together
Simple column Family
Super column Family : column family within
column family
Column Family
Image courtesy : http://www.ebaytechblog.com/author/jhpatel/#.VSPslfnF8SM
Column Family (Conti..)
 Access column using convention :
column_family:column
 Super column :
column_family:supercolumn:column
Facebook super column abstraction
 term search :
User Id = row key ;
Terms searched = supercolumn;
Message identifiers of message containing the
word = column
 Interaction
User ID : rowkey;
receipients’ IDs : supercolumn
Individual message identifier = columns
API
 Cassandra has thrift querying :
insert (table, key, row Mutation)
get(table, key, column Name)
delete(table, key, columnName)
Architecture
 Partitioning
 Replication
 Membership and Failure Detection
 Bootstrapping
 Scaling the cluster
 Local Persistence
Partitioning
 Data is partitioned dynamically over the nodes to
aid scaling.
 Implements order preserving consistent
hashing.(CH)
 Through consistent Hashing, coordinator for each
data key is determined.
 Advantages of CH : Departure and Arrival of
node only affects its neighbours.
 Disadvantage of CH : Non-uniform data
distribution . Hashing is unaware of the
heterogenity of the performance of nodes.
 Solution by Cassandra: Lightly loaded nodes
move on the ring to alleviate heavily loaded
nodes.
Replication :
 Required for ensuring High availability and
durability
 Replication Factor “N”
 Coordinator node is responsible for replication of
data at N-1 nodes.
 Replication Policies :
 Rack Unaware : replicated to N-1 successors of
coordinator
 Rack Aware Zookeeper is chosen,
 Data Center Aware informs the nodes what replicas to
store
• Meta data about ranges a node is
responsible for is stored in ZooKeeper as well as
the node.
Membership and Failure
Detection
 Membership is based on Scuttlebutt
 – Gossip based mechanism.
 Efficient CPU utilization
 Efficient utilization of gossip channel
 Used for membership and to disseminate system related
control state
 Failure detection : To check if the node is
available and to avoid attempts to communicate
with the unreachable nodes. – uses Modified ᶲ
Accrual Failure detector
 Failure detection emits suspicion level defined as ᶲ
instead of Boolean value.
Boot Strapping & Scaling
 Token assigned to new node is gossiped among all
the nodes.
 New node is assigned token so as alleviate the
heavily loaded node.
 New node reads the configuration file from the
ZooKeeper.
 Node outages are usually transient => Rebalancing
of partition assignment or repair of unreachable
replicas should be avoided.
 Change of node membership is manual.
 The heavily loaded node splits the data and
responsibility.
 Operational experience shows that the data can be
transferred at a rate of 40 Mbps from single node.
Local Persistence
 Relies on local file system
 Dedicated disk on each machine for commit log to
maximise the disk throughput
 Write :Data is first written to commit log and later to in-
memory data structure.
 After the data limit crosses a threshold value in the in-
memory DS, it is dumped to the disk.
 - Index is created for efficient lookup.
 Many files exist on the disk over time. Merge process
to collate these files into one file. Similar to
compaction process in Big Table.
 Generate index for 256K block for efficient lookup in
columns
Local Persistence (Conti)
 Read: Query the in-memory DS 1st. Then look up
in the disk.
 Files are looked up in the order of new to old.
 Bloom filter to check if the key exists in the file.
 Column indices
Reads and Writes
 Request for a key is routed to a node in the
cluster.
 Node determines the replicas and route request.
 Fail request if the replies are not received within
time.
 For Writes : routes request to replica and waits
for a quorun of replicas to acknowledge the
completion of writes
 For Reads : Based on Client set consistency
guarantee value, request is routed to either the
closest replica or request is routed to all replicas
and wait for the quorum of responses
Implementation
 Cassandra on each machine – partitioning
module, cluster membership, failure detection,
storage engine
 Implemented ground up using Java
 Purge commit log entries using rolling commit log
mechanism for 128 MB chunk.
 In memory DS and datafile for every column
family
 All writes to disk are sequential to maximize the
throughput
 No locks since the files dumped to the disk are
not mutated.
The After Story
 It was released as an open source project on Google
code in July 2008 which is now being developed and
marketed by Apache as Apache Cassandra
(henceforth referred as Cassandra in this slide).
 In Apache Cassandra, Super columns are stripped
due to performance issues. Instead composite column
is introduced
 Cassandra Query Language presents a data model
familiar to relational database users.
 Cassandra partitioning is still based on consistent
hashing, but has moved away from load balancing in
favor of virtual nodes,
 Order preserving hash function was ripped in favor of
a true OrderedPartitioner (later superseded by
ByteOrderedPartitioner).
The After Story (Conti..)
 In modern Cassandra terminology, the
coordinator is the node that processes a given
client’s request and routes it to the appropriate
replicas; it is not necessarily itself a replica.
 Zookeeper usage was restricted to Facebook’s in-
house Cassandra branch;
 Modern Cassandra management tools
include DataStax’s OpsCenter and Netflix’s
Priam.
Big Players
 Facebook Inbox search feature was implemented
on Cassandra where every user is an index and
the recipient and messages are stored as
columns. The sytem currently stores more than
50 TB of data on a 150 node cluster with a
median search latency of approximately 15 ms.
 Netflix, a video streaming firm stores 95% of its
data in Cassandra
 Ebay has implemented Cassandra for the
features like counts for “own” “want” “like” data on
its web page.
 Coursera, an online training service, has
Cassandra implemented for its mobile
applications
References:
 http://www.ebaytechblog.com/author/jhpatel/#.VS
PslfnF8SM
 http://www.divconq.com/2010/cassandra-
columns-and-supercolumns-and-rows/
 http://docs.datastax.com/en/articles/cassandra/ca
ssandrathenandnow.html
QUESTIONS?

5266732.ppt

  • 1.
    A Decentralized StructureStorage Model - Avinash Lakshman & Prashanth Malik - Presented by Srinidhi Katla CASSANDRA
  • 2.
    Topics covered:  WhatIs Cassandra  Motive  Data Model  Architecture  The After Story  Applications
  • 3.
    Features of Cassandra Distributed Storage system  Manages very large amounts of Data  Highly available  No Single point of failure  Simple data model  Dynamic control over Data layout and format  Designed to run on cheap commodity hardware  Handles high throughput while not sacrificing high read efficiency
  • 4.
    Motives behind Cassandra Storage needs of Inbox search problem o High write throughput o Increasing number of users o High search latencies due to data distribution.  Operational Requirements : o Scalability o Handle Hardware failure  Inbox Search was launched in 2008 for 100 million users ;  Is Deployed as backend storage system for multiple services within FB
  • 5.
    Data Model  Isbased on Amazon’s Dynamo and Google’s Big Table.  Table : distributed Multi-dimensional map indexed by a key  Consists : Row key, Column, Column Family, Super column Family  Row Key : Can be considered equivalent to primary index of the RDBMS.  Column : is a “name , value, time ” (e.g., “color=red”).  Column Family : Set of columns grouped together Simple column Family Super column Family : column family within column family
  • 6.
    Column Family Image courtesy: http://www.ebaytechblog.com/author/jhpatel/#.VSPslfnF8SM
  • 7.
    Column Family (Conti..) Access column using convention : column_family:column  Super column : column_family:supercolumn:column
  • 8.
    Facebook super columnabstraction  term search : User Id = row key ; Terms searched = supercolumn; Message identifiers of message containing the word = column  Interaction User ID : rowkey; receipients’ IDs : supercolumn Individual message identifier = columns
  • 9.
    API  Cassandra hasthrift querying : insert (table, key, row Mutation) get(table, key, column Name) delete(table, key, columnName)
  • 10.
    Architecture  Partitioning  Replication Membership and Failure Detection  Bootstrapping  Scaling the cluster  Local Persistence
  • 11.
    Partitioning  Data ispartitioned dynamically over the nodes to aid scaling.  Implements order preserving consistent hashing.(CH)  Through consistent Hashing, coordinator for each data key is determined.  Advantages of CH : Departure and Arrival of node only affects its neighbours.  Disadvantage of CH : Non-uniform data distribution . Hashing is unaware of the heterogenity of the performance of nodes.  Solution by Cassandra: Lightly loaded nodes move on the ring to alleviate heavily loaded nodes.
  • 12.
    Replication :  Requiredfor ensuring High availability and durability  Replication Factor “N”  Coordinator node is responsible for replication of data at N-1 nodes.  Replication Policies :  Rack Unaware : replicated to N-1 successors of coordinator  Rack Aware Zookeeper is chosen,  Data Center Aware informs the nodes what replicas to store • Meta data about ranges a node is responsible for is stored in ZooKeeper as well as the node.
  • 13.
    Membership and Failure Detection Membership is based on Scuttlebutt  – Gossip based mechanism.  Efficient CPU utilization  Efficient utilization of gossip channel  Used for membership and to disseminate system related control state  Failure detection : To check if the node is available and to avoid attempts to communicate with the unreachable nodes. – uses Modified ᶲ Accrual Failure detector  Failure detection emits suspicion level defined as ᶲ instead of Boolean value.
  • 14.
    Boot Strapping &Scaling  Token assigned to new node is gossiped among all the nodes.  New node is assigned token so as alleviate the heavily loaded node.  New node reads the configuration file from the ZooKeeper.  Node outages are usually transient => Rebalancing of partition assignment or repair of unreachable replicas should be avoided.  Change of node membership is manual.  The heavily loaded node splits the data and responsibility.  Operational experience shows that the data can be transferred at a rate of 40 Mbps from single node.
  • 15.
    Local Persistence  Relieson local file system  Dedicated disk on each machine for commit log to maximise the disk throughput  Write :Data is first written to commit log and later to in- memory data structure.  After the data limit crosses a threshold value in the in- memory DS, it is dumped to the disk.  - Index is created for efficient lookup.  Many files exist on the disk over time. Merge process to collate these files into one file. Similar to compaction process in Big Table.  Generate index for 256K block for efficient lookup in columns
  • 16.
    Local Persistence (Conti) Read: Query the in-memory DS 1st. Then look up in the disk.  Files are looked up in the order of new to old.  Bloom filter to check if the key exists in the file.  Column indices
  • 17.
    Reads and Writes Request for a key is routed to a node in the cluster.  Node determines the replicas and route request.  Fail request if the replies are not received within time.  For Writes : routes request to replica and waits for a quorun of replicas to acknowledge the completion of writes  For Reads : Based on Client set consistency guarantee value, request is routed to either the closest replica or request is routed to all replicas and wait for the quorum of responses
  • 18.
    Implementation  Cassandra oneach machine – partitioning module, cluster membership, failure detection, storage engine  Implemented ground up using Java  Purge commit log entries using rolling commit log mechanism for 128 MB chunk.  In memory DS and datafile for every column family  All writes to disk are sequential to maximize the throughput  No locks since the files dumped to the disk are not mutated.
  • 19.
    The After Story It was released as an open source project on Google code in July 2008 which is now being developed and marketed by Apache as Apache Cassandra (henceforth referred as Cassandra in this slide).  In Apache Cassandra, Super columns are stripped due to performance issues. Instead composite column is introduced  Cassandra Query Language presents a data model familiar to relational database users.  Cassandra partitioning is still based on consistent hashing, but has moved away from load balancing in favor of virtual nodes,  Order preserving hash function was ripped in favor of a true OrderedPartitioner (later superseded by ByteOrderedPartitioner).
  • 20.
    The After Story(Conti..)  In modern Cassandra terminology, the coordinator is the node that processes a given client’s request and routes it to the appropriate replicas; it is not necessarily itself a replica.  Zookeeper usage was restricted to Facebook’s in- house Cassandra branch;  Modern Cassandra management tools include DataStax’s OpsCenter and Netflix’s Priam.
  • 21.
    Big Players  FacebookInbox search feature was implemented on Cassandra where every user is an index and the recipient and messages are stored as columns. The sytem currently stores more than 50 TB of data on a 150 node cluster with a median search latency of approximately 15 ms.  Netflix, a video streaming firm stores 95% of its data in Cassandra  Ebay has implemented Cassandra for the features like counts for “own” “want” “like” data on its web page.  Coursera, an online training service, has Cassandra implemented for its mobile applications
  • 22.
  • 23.

Editor's Notes

  • #8 Row keys and super column keys do not have any values Column Keys and Supercolumn Keys are indexed and sorted by a specific type. super column keys in different rows do not have to match and often will not.
  • #10 with Cassandra you need to think about what queries you want to support efficiently ahead of time, and model appropriately. Since there are no automatically-provided indexes, you will be much closer to one ColumnFamily per query than you would have been with tables:queries relationally. Don't be afraid to denormalize accordingly; Cassandra is much, much faster at writes than relational systems.