Introduciton to Apache Cassandra for Java Developers (JavaOne)

1.
Apache Cassandra AnIntroduction for Java Developers Nate McCall [email_address] @zznate

2.
What is ApacheCassandra?

3.
CAP Theorem C onsistency A vailability P artition Tolerance “ Though shalt have but 2” - Conjecture made by Eric Brewer in 2000 - Published as formal proof in 2002 - See: http://en.wikipedia.org/wiki/CAP_theorem for more

4.
Apache Cassandra Concepts- Explicit choice of partition tolerance and availability. Consistency is tunable. - No read before write - Merge on read - Idempotent - Schema Optional - All nodes share the same roll - Still performs well with larger-than-memory data sets

5.
Generally complements anothersystem(s) (Not intended to be one-size-fits-all) *** You should always use the right tool for the right job anyway

6.
How does thisdiffer from an RDBMS?

7.
How does thisdiffer from an RDBMS? Substantially.

8.
vs. RDBMS -No Joins Unless: - you do them on the client - you do them via Map/Reduce

9.
vs. RDBMS -Schema Optional (Though you can add meta information for validation and type checking) *** Supports secondary indexes too: “ … WHERE state = 'TX' ”

10.
vs. RDBMS -Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries

11.
vs. RDBMS -Prematerialized and Transaction-less - No ACID transactions - Limited support for ad-hoc queries *** You are going to give up both of these anyway when you shard an RDBMS ***

12.
vs. RDBMS -Facilitates Consolidation It can be your caching layer * Off-heap cache (provided you install JNA) It can be your analytics infrastructure * true map/reduce * pig driver * hive driver coming soon

13.
vs. RDBMS -Shared-Nothing Architecture Every node plays the same role: no masters, no slaves, no special nodes *** No single point of failure

14.
vs. RDBMS -Real Linear Scalability Want 2x performance? Add 2x nodes. *** 'No downtime' included!

15.
vs. RDBMS -Performance Reads on par with writes

16.
Clustering

17.
Clustering Single nodecluster (easy development setup) - one node owns the whole hash range

18.
Clustering Two nodecluster - Key range divided between nodes

19.
Clustering Consistent Hashing:md5(“zznate”) = “C”

20.
Clustering Consistent HashingFTW: - Ring ownership continuously “gossiped” between nodes - Any node can act as a “coordinator” to service client requests for any key * requests forwarded to the appropriate nodes by coordinator transparently to the client

21.
Clustering Client Read: get(“zznate”) md5 = “C”

22.
Clustering – ScaleOut

23.

24.

25.
Clustering - Multi-DC

26.
Clustering - Reliability

27.

28.

29.

30.
Clustering - Multi-Datacenter

31.
Clustering – Multi-DCReliability

32.
Storage (Briefly)

33.
Storage (Briefly) Understanding the on-disk format is extremely helpful in designing your data model correctly

34.
Storage - SSTable- SSTables are immutable (“Merge on read”) - Newest timestamp wins

35.
Storage – CompactionMerge SSTables – keeping count down making Merge on Read more efficient Discards Tombstones (more on this later!)

36.
Data Model

37.
Data Model "...sparse,persistent, distributed, multi-dimensional sorted map." (The “Bigtable” paper)

38.
Data Model Keyspace- Collection of Column Families

39.
- Controls replication

40.
Column Family

41.
- Similar toa table

42.
- Columns orderedby name

43.
Data Model –Column Family Static Column Family - Model my object data

44.
Dynamic Column Family

45.
- Pre-calculated queryresults

46.
Nothing stopping youfrom mixing them!

47.
Data Model –Static CF zznate driftx thobbs jbellis password : * password : * password : * name : Nate name : Brandon name : Tyler password : * name : Jonathan site : datastax.com Users

48.
Data Model –Prematerialized Query Following zznate driftx thobbs jbellis driftx: thobbs: driftx: thobbs: mdennis: zznate zznate: pcmanus xedin:

49.
Data Model –Prematerialized Query Additional examples: Timeline of tweets by a user Timeline of tweets by all of the people a user is following List of comments sorted by score List of friends grouped by state

50.
API Operations

51.
Five general categoriesRetrieving Writing/Updating/Removing (all the same op!) Increment counters Meta Information Schema Manipulation CQL Execution

52.
Using a ClientHector Client: http://hector-client.org - Most popular Java client - In use at very large installations - A number of tools and utilities built on top - Very active community - MIT Licensed *** like any open source project fully dependent on another open source project it has it's worts

53.
Sample Project forExperimenting https://github.com/zznate/cassandra-tutorial https://github.com/zznate/hector-examples Built using Hector Really basic – designed to be beginner level w/ very few moving parts Modify/abuse/alter as needed *** Descriptions of what is going on and how to run each example are in the Javadoc comments.

54.
ColumnFamilyTemplate Familiar, type-safeapproach - based on template-method design pattern - generic: ColumnFamilyTemplate<K,N> (K is the key type, N the column name type) ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); *** (no generics for clarity)

55.
ColumnFamilyTemplate new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get()); Key Format Column Name Format - Cassandra calls this a “comparator” - Remember: defines column order in on-disk format

56.
ColumnFamilyTemplate ColumnFamilyResult<String, String>res = cft.queryColumns("zznate"); String value = res.getString("email"); Date startDate = res.getDate(“startDate”); Key Format Column Name Format

57.
ColumnFamilyTemplate ColumnFamilyResult wrapper= template.queryColumns("zznate", "patricioe", "thobbs"); String nateEmail = wrapper.getString("email"); wrapper.next(); String patoEmail = wrapper.getString("email"); wrapper.next(); String tylerEmail = wrapper.getString("email"); Querying multiple rows and iterating over results

58.
ColumnFamilyTemplate ColumnFamilyUpdater updater= template.createUpdater("zznate"); updater.setString("companyName","DataStax"); updater.addKey("sergek"); updater.setString("companyName","PrestoSports"); template.update(updater); Inserting data with ColumnFamilyUpdater

59.
ColumnFamilyTemplate template.deleteColumn("zznate", "notNeededStuff");template.deleteColumn("zznate", "somethingElse"); template.deleteColumn("patricioe", "aDifferentColumnName"); ... template.deleteRow(“someuser”); template.executeBatch(); Deleting Data with ColumnFamilyTemplate

60.
Deletion

61.
Deletion Again: Everymutation is an insert!

62.
- Merge onread

63.
- Sstables areimmutable

64.
- Highest timestampwins

65.
Deletion – AsSeen by CLI [default@Tutorial] list StateCity; Using default limit of 100

66.
-------------------

67.
RowKey: CA Burlingame

68.
=> (column=650, value=33372e3537783132322e3334,timestamp=1310340410528000)

69.
-------------------

70.
RowKey: TX Austin

71.

72.

73.

74.

75.

76.
Deletion – AsSeen by CLI [default@Tutorial] list StateCity; Using default limit of 100

77.
-------------------

78.
RowKey: CA Burlingame

79.
-------------------

80.
RowKey: TX Austin

81.

82.

83.

84.

85.

86.
Deletion – FYImutator.addDeletion("202230", "Npanxx", “city”, stringSerializer); Does not exist? You just inserted a tombstone! Sending a deletion for a non-existing row: [default@Tutorial] list Npanxx; Using default limit of 100

87.
. . .

88.
-------------------

89.
RowKey: 202230

90.
-------------------

91.
. . .

92.
Integrating with existingpatterns

93.
Integrating with existingpatterns “ Yes.”

94.
Integrating with existingpatterns <bean id="cassandraHostConfigurator"

95.
class="me.prettyprint.cassandra.service.CassandraHostConfigurator">

96.
<constructor-arg value="localhost:9170"/>

97.
</bean>

98.
<bean id="cluster" class="me.prettyprint.cassandra.service.ThriftCluster">

99.
<constructor-arg value="TestCluster"/>

100.
<constructor-arg ref="cassandraHostConfigurator"/>

101.
</bean>

102.
<bean id="consistencyLevelPolicy" class="me.prettyprint.cassandra.model.ConfigurableConsistencyLevel">

103.
<property name="defaultReadConsistencyLevel" value="ONE"/>

104.
</bean>

105.
<bean id="keyspaceOperator" class="me.prettyprint.hector.api.factory.HFactory"

106.
factory-method="createKeyspace">

107.
<constructor-arg value="Keyspace1"/>

108.
<constructor-arg ref="cluster"/>

109.
<constructor-arg ref="consistencyLevelPolicy"/>

110.
</bean>

111.
<bean id="simpleCassandraDao" class="me.prettyprint.cassandra.dao.SimpleCassandraDao">

112.
<property name="keyspace" ref="keyspaceOperator"/>

113.
<property name="columnFamilyName" value="Standard1"/>

114.
</bean>

115.
Integrating with existingpatterns Hector Object Mapper:

116.
https://github.com/rantav/hector/wiki/Hector-Object-Mapper-%28HOM%29

117.
Hector JPA:

118.
https://github.com/riptano/hector-jpa

119.
Integrating with existingpatterns CQL: JDBC Driver and Pool in 1.0!

120.
JdbcTemplate FTW!

121.
Development Resources HectorDocumentation http://hector-client.org Cassandra Maven Plugin http://mojo.codehaus.org/cassandra-maven-plugin/

122.
CCM localhost cassandracluster https://github.com/pcmanus/ccm

123.
OpsCenter http://www.datastax.com/products/opscenter CassandraAMIs https://github.com/riptano/CassandraClusterAMI

124.
Putting it Together

125.
Take control ofconsistency If you do need a high degree of consistency, use thresholds to trigger different behavior

126.
- Bank account:

127.
“ on valuesover $10,000, wait to here from all replicas”

128.
- Distributed ShoppingCart:

129.
Show a confirmationpage to verify order resolution

130.
*** What isyour appetite for risk?

131.
Uniquely identify operationsin the application Facilitates idempotent behavior and out-of-order execution

132.
Denormalization The pointof normalization is to avoid update anomalies

133.
***But In anappend-only system, we don't do updates

134.
Summary - Takeadvantage of strengths

135.
- Look foridempotence and asynchronicity in your business processes

136.
- If it'snot in the API, you are probably doing it wrong

137.
- Seek deathis still possible if you model incorrectly

138.
Questions Nate McCall[email_address] @zznate

139.
Additional Resources DataStaxDocumentation: http://www.datastax.com/docs/0.8/index

140.
Apache Cassandra projectwiki: http://wiki.apache.org/cassandra/

141.
“ The DynamoPaper”

142.
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

143.
P. Helland. Buildingon Quicksand

144.
http://arxiv.org/pdf/0909.1788

145.
P. Helland. LifeBeyond Distributed Transactions

146.
http://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf

147.
S. Anand. “Netflix'sTransition to High-Availability Storage Systems”

148.
http://media.amazonwebservices.com/Netflix_Transition_to_a_Key_v3.pdf

149.
“ The MegastorePaper”

150.
http://research.google.com/pubs/archive/36971.pdf

Introduciton to Apache Cassandra for Java Developers (JavaOne)

More Related Content

What's hot

Viewers also liked

Similar to Introduciton to Apache Cassandra for Java Developers (JavaOne)

More from zznate

Recently uploaded

Introduciton to Apache Cassandra for Java Developers (JavaOne)