cassandra
Where Did Cassandra Come From
• Cassandra originated at Facebook in 2007 to
  solve that company’s inbox search problem
  – large volumes of data
  – many random reads
  – many simultaneous random writes
• was released as an open source Google Code
  project in July 2008
• March 2009 it was moved to an Apache Incubator
  project
• February 17, 2010 it was voted into a top-level
  project
Cassandra in 50 Words or Less
• Apache Cassandra is an
    –   open source
    –   distributed
    –   Decentralized
    –   elastically scalable
    –   highly available
    –   fault-tolerant
    –   tuneably consistent
    –   column-oriented
•   Database that
•   bases its distribution design on Amazon’s Dynamo
•   its data model on Google’s Bigtable
•   Created at Facebook
•   it is now used at some of the most popular sites on the Web
Who Is Using Cassandra
• Twitter is using Cassandra for analytics.
• Mahalo uses it for its primary near-time data store.
• Facebook still uses it for inbox search, though they are using a
  proprietary fork.
• Digg uses it for its primary near-time data store.
• Rackspace uses it for its cloud service, monitoring, and logging.
• Reddit uses it as a persistent cache.
• Cloudkick uses it for monitoring statistics and analytics.
• Ooyala uses it to store and serve near real-time video analytics
  data.
• SimpleGeo uses it as the main data store for its real-time location
  infrastructure.
• Onespot uses it for a subset of its main data store
Decentralized


• Master/slave:
     Decentralized                Master/slave
     all nodes are the same,      If the master node fails, the
     failures of a                whole database is in jeopardy
     node won’t disrupt service
Elastic Scalability
• add another machine—Cassandra will find it
  and start sending it work
High Availability and Fault Tolerance
SCID
• Atomic
  – All or nothing
• Consistent

• Isolated
  – Two transaction modify same data
• Durable
Brewer’s CAP Theorem
• you can strongly support only two of the Three:
  – Consistency
     • All database client will read the same value for same query,
       even given concurrent updates
  – Availability
     • All database clients will always be able to read and write
       data
  – Partition Tolerance
     • The database can be split into multiple machines
     • It can continue functioning in fact of network segmentation
       breaks
CAP




transaction
usage
•   Connect localhost/9160 ;
•   Show cluster name
•   Show keyspaces
•   Create keyspace XXXXX
•   Use XXXXX
•   Create column family YYYYY
•   Describe keyspace XXXXX
• Set YYYYY[“XiaoMing”][“name”] = “小明”
• Get YYYYY[“XiaoMing”]
• List
• Map
• MapList<row_id, Map>
• Column Family 列簇
• create column family User
  with key_validation_class=UTF8Type
Column family
• Ddd
Super column family
• d
Clusters (Ring)
• If the first node goes down, a replica can
  respond to queries. The peer-to-peer protocol
  allows the data to replicate across nodes in a
  manner transparent to the user

• Replaction factor
Keyspaces
• Don’t add too much Keyspaces

• (database)
Gossip protocols
• intra-ring communication so that each node
  can have state information about other nodes
• Runs every second
• Gossip Message:
  – Send: GossipDigestSynMessage
  – Ack: GossipDigestAckMessage
  – send: GossipDigestAck2Message
• algorithm :
  – Phi Accrual Failure Detection
Anti-entropy
• Anti-entropy is the replica synchronization
  mechanism in Cassandra for ensuring that
  data on different nodes is updated to the
  newest version
• Merkle tree
Memtable&SSTable&CommitLog
• Memtable
  – Value is written to a memory-resident data structure
• SSTable
  – Include: Data, Index, and Filter
  – concept borrowed from Google’s Bigtable
  – Memtable reaches a threshold, flushed to disk
• Commit log
  – Flush status: 0 / 1
     • 1:start to flush
     • 0: flush success
hinted handoff & Compaction
• hinted handoff
  – When a write no available
  – Create a hint to node Cassandra


• Compaction:
  – In order to merge SSTable
  – merged data is sorted
  – new index is created over the sorted data
major compaction
• stored in memory
• used to improve performance by reducing disk
  access on key lookups
Tombstones 墓碑
• Knows as “soft delete”
• Not immediately deleted after execute a
  delete operation
• Garbage Collection Grace Seconds:
  – GCGraceSeconds
     • Default: 10 days (864000 sec)
Staged Event-Driven Architecture
                (SEDA)
• originally proposed in a 2001 paper called “SEDA: An
  Architecture for Well-Conditioned, Scalable Internet
  Services”
• A stage consists of an incoming event queue
   –   Read
   –   Mutation
   –   Gossip
   –   Response
   –   Anti-Entropy
   –   Load Balance
   –   Migration
   –   Streaming
   –   …
Custom FactoryUtil
• Prevent version uncompatible
Configuring Cassandra
• system_add_keyspace
   – Creates a keyspace.
• system_rename_keyspace
   – Changes the name of a keyspace after taking a snapshot of it. Note that this
     method
   – blocks until its work is done.
• system_drop_keyspace
   – Deletes an entire keyspace after taking a snapshot of it.
• system_add_column_family
   – Creates a column family.
• system_drop_column_family
   – Deletes a column family after taking a snapshot of it.
• system_rename_column_family
   – Changes the name of a column family after taking a snapshot of it. Note that
     this
   – method blocks until its work is done.
Creating a Column Family
•   column_type
      – Either Super or Standard.
•   clock_type
      – The only valid value is Timestamp.
•   comparator
      – Valid options include AsciiType, BytesType, LexicalUUIDType, LongType, TimeUUID Type, and UTF8Type.
•   subcomparator
      – Name of comparator used for subcolumns when the column_type is Super. Valid options are the same as comparator.
•   reconciler
      – Name of the class that will reconcile conflicting column versions. The only valid value at this time is Timestamp.
•   comment
      – Any human-readable comment in the form of a string.
•   rows_cached
      – The number of rows to cache.
•   preload_row_cache
      – Set this to true to automatically load the row cache.
•   key_cache_size
      – The number of keys to pull into the cache.
•   read_repair_chance
      – Valid values are a number between 0.0 and 1.0.
Replicas
• Simple Strategy
  – RackUnawareStrategy
• Old Network Topology Strategy
  – RackAwareStrategy
• Network Topology Strategy
  – DataCenterShardStrategy
  – datacenter.properties
Replication Factor
• specifies how many copies of each piece of
  data will be stored and distributed throughout
  the Cassandra cluster
• Factor = 1 : your data will exist only in a single
  node in the cluster. Losing that node means
  that data becomes unavailable
Increasing the Replication Factor
• Nodes grows and should increasing factor
• How to do:
  – ensure that all the data is flushed to the SSTables
     • flush -h 192.168.1.1 -p 9160
  – stop that node
  – copy the datafiles from your keyspaces
  – Paste those datafiles to the new node
Replica Placement Strategies
• Simple Strategy
• Old Network Topology Strategy
• Network Topology Strategy
Adding Nodes to a Cluster
• If you want to add a new seed node, then you should
  autobootstrap it first, and then change it to a seed
  afterward

• Node1:
   – listen_address: 192.168.1.1
   – rpc_address: 0.0.0.0
• Node2:
   – auto_bootstrap: true
   – listen_address: 192.168.2.34
   – rpc_address: 0.0.0.0
Hector
• Cluster myCluster =
  HFactory.getOrCreateCluster("Test Cluster",
  "192.168.2.3:9160");

• ThriftCfDef columnFamilyDefinition = new
  ThriftCfDef("s3","nb",ComparatorType.UTF8TYPE
  );
•
  columnFamilyDefinition.setReplicateOnWrite(tru
  e);
Hector
• ThriftCfDef columnFamilyDefinition = new
  ThriftCfDef("s3","bb",ComparatorType.UTF8TYPE);
•
  columnFamilyDefinition.setKeyValidationClass("org.apache.
  cassandra.db.marshal.UTF8Type");
•
  columnFamilyDefinition.setDefaultValidationClass("org.apa
  che.cassandra.db.marshal.UTF8Type");
•
  //myCluster.addColumnFamily(columnFamilyDefinition) ;
•     columnFamilyDefinition.setId(1013);
•
  myCluster.updateColumnFamily(columnFamilyDefinition);
Hector
• Keyspace myKeyspace =
  HFactory.createKeyspace("s3", myCluster);
•      Mutator<String> mutator =
  HFactory.createMutator(myKeyspace,
  StringSerializer.get());


•     mutator.insert("b", "bb",
    HFactory.createStringColumn("column1", "你好
    在"));
Hector
• ColumnQuery q = HFactory.createColumnQuery(myKeyspace,
  StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
• // set key, name, cf and execute
• QueryResult<HColumn> r = q
•      .setColumnFamily("bb")
•      .setKey("b")
•      .setName("column1")
•      .execute();
• // read value from the result
• HColumn<String,String> c = r.get();
• String value = c.getValue();
• System.out.println(value);

Cassandra

  • 1.
  • 2.
    Where Did CassandraCome From • Cassandra originated at Facebook in 2007 to solve that company’s inbox search problem – large volumes of data – many random reads – many simultaneous random writes • was released as an open source Google Code project in July 2008 • March 2009 it was moved to an Apache Incubator project • February 17, 2010 it was voted into a top-level project
  • 3.
    Cassandra in 50Words or Less • Apache Cassandra is an – open source – distributed – Decentralized – elastically scalable – highly available – fault-tolerant – tuneably consistent – column-oriented • Database that • bases its distribution design on Amazon’s Dynamo • its data model on Google’s Bigtable • Created at Facebook • it is now used at some of the most popular sites on the Web
  • 4.
    Who Is UsingCassandra • Twitter is using Cassandra for analytics. • Mahalo uses it for its primary near-time data store. • Facebook still uses it for inbox search, though they are using a proprietary fork. • Digg uses it for its primary near-time data store. • Rackspace uses it for its cloud service, monitoring, and logging. • Reddit uses it as a persistent cache. • Cloudkick uses it for monitoring statistics and analytics. • Ooyala uses it to store and serve near real-time video analytics data. • SimpleGeo uses it as the main data store for its real-time location infrastructure. • Onespot uses it for a subset of its main data store
  • 5.
    Decentralized • Master/slave: Decentralized Master/slave all nodes are the same, If the master node fails, the failures of a whole database is in jeopardy node won’t disrupt service
  • 6.
    Elastic Scalability • addanother machine—Cassandra will find it and start sending it work
  • 7.
    High Availability andFault Tolerance
  • 8.
    SCID • Atomic – All or nothing • Consistent • Isolated – Two transaction modify same data • Durable
  • 9.
    Brewer’s CAP Theorem •you can strongly support only two of the Three: – Consistency • All database client will read the same value for same query, even given concurrent updates – Availability • All database clients will always be able to read and write data – Partition Tolerance • The database can be split into multiple machines • It can continue functioning in fact of network segmentation breaks
  • 10.
  • 11.
    usage • Connect localhost/9160 ; • Show cluster name • Show keyspaces • Create keyspace XXXXX • Use XXXXX • Create column family YYYYY • Describe keyspace XXXXX
  • 12.
    • Set YYYYY[“XiaoMing”][“name”]= “小明” • Get YYYYY[“XiaoMing”]
  • 13.
    • List • Map •MapList<row_id, Map>
  • 14.
    • Column Family列簇 • create column family User with key_validation_class=UTF8Type
  • 15.
  • 16.
  • 17.
    Clusters (Ring) • Ifthe first node goes down, a replica can respond to queries. The peer-to-peer protocol allows the data to replicate across nodes in a manner transparent to the user • Replaction factor
  • 18.
    Keyspaces • Don’t addtoo much Keyspaces • (database)
  • 19.
    Gossip protocols • intra-ringcommunication so that each node can have state information about other nodes • Runs every second • Gossip Message: – Send: GossipDigestSynMessage – Ack: GossipDigestAckMessage – send: GossipDigestAck2Message • algorithm : – Phi Accrual Failure Detection
  • 20.
    Anti-entropy • Anti-entropy isthe replica synchronization mechanism in Cassandra for ensuring that data on different nodes is updated to the newest version • Merkle tree
  • 21.
    Memtable&SSTable&CommitLog • Memtable – Value is written to a memory-resident data structure • SSTable – Include: Data, Index, and Filter – concept borrowed from Google’s Bigtable – Memtable reaches a threshold, flushed to disk • Commit log – Flush status: 0 / 1 • 1:start to flush • 0: flush success
  • 22.
    hinted handoff &Compaction • hinted handoff – When a write no available – Create a hint to node Cassandra • Compaction: – In order to merge SSTable – merged data is sorted – new index is created over the sorted data
  • 23.
    major compaction • storedin memory • used to improve performance by reducing disk access on key lookups
  • 24.
    Tombstones 墓碑 • Knowsas “soft delete” • Not immediately deleted after execute a delete operation • Garbage Collection Grace Seconds: – GCGraceSeconds • Default: 10 days (864000 sec)
  • 25.
    Staged Event-Driven Architecture (SEDA) • originally proposed in a 2001 paper called “SEDA: An Architecture for Well-Conditioned, Scalable Internet Services” • A stage consists of an incoming event queue – Read – Mutation – Gossip – Response – Anti-Entropy – Load Balance – Migration – Streaming – …
  • 26.
    Custom FactoryUtil • Preventversion uncompatible
  • 27.
    Configuring Cassandra • system_add_keyspace – Creates a keyspace. • system_rename_keyspace – Changes the name of a keyspace after taking a snapshot of it. Note that this method – blocks until its work is done. • system_drop_keyspace – Deletes an entire keyspace after taking a snapshot of it. • system_add_column_family – Creates a column family. • system_drop_column_family – Deletes a column family after taking a snapshot of it. • system_rename_column_family – Changes the name of a column family after taking a snapshot of it. Note that this – method blocks until its work is done.
  • 28.
    Creating a ColumnFamily • column_type – Either Super or Standard. • clock_type – The only valid value is Timestamp. • comparator – Valid options include AsciiType, BytesType, LexicalUUIDType, LongType, TimeUUID Type, and UTF8Type. • subcomparator – Name of comparator used for subcolumns when the column_type is Super. Valid options are the same as comparator. • reconciler – Name of the class that will reconcile conflicting column versions. The only valid value at this time is Timestamp. • comment – Any human-readable comment in the form of a string. • rows_cached – The number of rows to cache. • preload_row_cache – Set this to true to automatically load the row cache. • key_cache_size – The number of keys to pull into the cache. • read_repair_chance – Valid values are a number between 0.0 and 1.0.
  • 29.
    Replicas • Simple Strategy – RackUnawareStrategy • Old Network Topology Strategy – RackAwareStrategy • Network Topology Strategy – DataCenterShardStrategy – datacenter.properties
  • 30.
    Replication Factor • specifieshow many copies of each piece of data will be stored and distributed throughout the Cassandra cluster • Factor = 1 : your data will exist only in a single node in the cluster. Losing that node means that data becomes unavailable
  • 31.
    Increasing the ReplicationFactor • Nodes grows and should increasing factor • How to do: – ensure that all the data is flushed to the SSTables • flush -h 192.168.1.1 -p 9160 – stop that node – copy the datafiles from your keyspaces – Paste those datafiles to the new node
  • 32.
    Replica Placement Strategies •Simple Strategy • Old Network Topology Strategy • Network Topology Strategy
  • 33.
    Adding Nodes toa Cluster • If you want to add a new seed node, then you should autobootstrap it first, and then change it to a seed afterward • Node1: – listen_address: 192.168.1.1 – rpc_address: 0.0.0.0 • Node2: – auto_bootstrap: true – listen_address: 192.168.2.34 – rpc_address: 0.0.0.0
  • 34.
    Hector • Cluster myCluster= HFactory.getOrCreateCluster("Test Cluster", "192.168.2.3:9160"); • ThriftCfDef columnFamilyDefinition = new ThriftCfDef("s3","nb",ComparatorType.UTF8TYPE ); • columnFamilyDefinition.setReplicateOnWrite(tru e);
  • 35.
    Hector • ThriftCfDef columnFamilyDefinition= new ThriftCfDef("s3","bb",ComparatorType.UTF8TYPE); • columnFamilyDefinition.setKeyValidationClass("org.apache. cassandra.db.marshal.UTF8Type"); • columnFamilyDefinition.setDefaultValidationClass("org.apa che.cassandra.db.marshal.UTF8Type"); • //myCluster.addColumnFamily(columnFamilyDefinition) ; • columnFamilyDefinition.setId(1013); • myCluster.updateColumnFamily(columnFamilyDefinition);
  • 36.
    Hector • Keyspace myKeyspace= HFactory.createKeyspace("s3", myCluster); • Mutator<String> mutator = HFactory.createMutator(myKeyspace, StringSerializer.get()); • mutator.insert("b", "bb", HFactory.createStringColumn("column1", "你好 在"));
  • 37.
    Hector • ColumnQuery q= HFactory.createColumnQuery(myKeyspace, StringSerializer.get(), StringSerializer.get(), StringSerializer.get()); • // set key, name, cf and execute • QueryResult<HColumn> r = q • .setColumnFamily("bb") • .setKey("b") • .setName("column1") • .execute(); • // read value from the result • HColumn<String,String> c = r.get(); • String value = c.getValue(); • System.out.println(value);