C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin Charles

#CASSANDRA13
Colin Charles | colin@mariadb.org | SkySQL Ab | http://mariadb.org/
@bytebot on Twitter | http://bytebot.net/blog/
MariaDB and Cassandra Interoperability

#CASSANDRA13
whoami
*Work on MariaDB today
*Formerly of MySQL AB (USD$1bil acquisition -> Sun Microsystems)
*Worked on The Fedora Project & OpenOffice.org previously
*Monty Program Ab is a major sponsor of MariaDB
*SkySQL & Monty Program Ab merge
*MariaDB governed by MariaDB Foundation

#CASSANDRA13
What we will discuss today...
*What is MariaDB?
*MariaDB Architecture
*The Cassandra Storage Engine (CassandraSE)
*Data & Command Mapping
*Use Cases
*Benchmarks
*Conclusions

#CASSANDRA13
What is MariaDB?
*Community developed, feature enhanced, backward compatible MySQL
*Drop-in replacement to MySQL
*Shipped in many Linux distributions as a default
*Enhanced features: threadpool, table elimination, optimizer changes
(subqueries materialize!), group commit in the replication binary log,
HandlerSocket, SphinxSE, multi-source replication, dynamic columns

#CASSANDRA13
MariaDB/MySQL and NoSQL
*HandlerSocket
*memcached access to InnoDB
*Hadoop Applier
*LevelDB Storage Engine
*Cassandra Storage Engine
*CONNECT Engine

#CASSANDRA13
Dynamic Columns
*Store a different set of columns for every row in the table
*Basically a blob with handling functions (GET, CREATE, ADD, DELETE,
EXISTS, LIST, JSON)
*Dynamic columns can be nested
*You can request rows in JSON format
*You can now name dynamic columns as well
INSERT INTO tbl SET
dyncol_blob=COLUMN_CREATE("column_name", "value");

#CASSANDRA13
Cassandra background
*Distributed key/value store (limited range scan support),
optionally flexible schema (pre-defined “static” columns,
ad-hoc dynamic columns), automatic sharding/
replication, eventual consistency
*Column families are like “tables”
*Row key -> column mapping
*Supercolumns are not supported

#CASSANDRA13
CQL at work
cqlsh> CREATE KEYSPACE mariadbtest
... WITH REPLICATION ={'class':'SimpleStrategy','replication_factor':1};
cqlsh> use mariadbtest;
cqlsh:mariadbtest> create columnfamily cf1 ( pk varchar primary key, data1 varchar, data2 bigint ) with compactstorage;
cqlsh:mariadbtest> insert into cf1 (pk, data1,data2) values ('row1', 'data-in-cassandra', 1234);
cqlsh:mariadbtest> select * from cf1;
pk | data1 | data2
------+-------------------+-------
row1 | data-in-cassandra | 1234
cqlsh:mariadbtest> select * from cf1 where pk='row1';
pk | data1 | data2
------+-------------------+-------
cqlsh:mariadbtest> select * from cf1 where data2=1234;
Bad Request: No indexed columns present in by-columns clause with Equal operator
cqlsh:mariadbtest> select * from cf1 where pk='row1' or pk='row2';
Bad Request: line 1:34 missing EOF at 'or'

#CASSANDRA13
CQL
*Looks like SQL at first glance
*No joins or subqueries
*No GROUP BY, ORDER BY must be able to use available indexes
*WHERE clause must represent an index lookup
*Simple goal of the Cassandra Storage Engine? Provide a “view” of
Cassandra’s data from MariaDB

#CASSANDRA13
Getting started
*Get MariaDB 10.0.2 from https://downloads.mariadb.org/
*Load the Cassandra plugin
- From SQL:
MariaDB [(none)]> install plugin cassandra soname 'ha_cassandra.so';
- Or start it from my.cnf
[mysqld]
...
plugin-load=ha_cassandra.so

#CASSANDRA13
Create an SQL table which is a view of a column family
MariaDB [test]> set global cassandra_default_thrift_host='10.196.2.113';
MariaDB [test]> create table t2 (pk varchar(36) primary key,
-> data1 varchar(60),
-> data2 bigint
-> ) engine=cassandra
-> keyspace='mariadbtest'
-> thrift_host='10.196.2.113'
-> column_family='cf1';
*thrift_host can be set per-table
*@@cassandra_default_thrift_host allows to re-point the table to different node
dynamically, and not change table DDL when Cassandra IP changes

#CASSANDRA13
Potential issues
*SELinux blocks the connection
ERROR 1429 (HY000): Unable to connect to foreign data source:
connect() failed: Permission denied [1]
*Disable SELinux: echo 0 > /selinux/enforce
*Cassandra 1.2 with Column Families (CFs) without “COMPACT
STORAGE”
ERROR 1429 (HY000): Unable to connect to foreign data source:
Column family cf1 not found in keyspace mariadbtest
*Change in Cassandra 1.2 that broke Pig as well; we’ll update this soon

#CASSANDRA13
Data mapping between Cassandra and SQL
create table tbl (
pk varchar(36) primary key,
data1 varchar(60),
data2 bigint
) engine=cassandra keyspace='ks1' column_family='cf1'
*MariaDB table represents Cassandra’s Column Family
- can use any table name, column_family=... specifies CF

#CASSANDRA13
create table tbl (
data1 varchar(60),
data2 bigint
*Table must have a primary key
- name/type must match Cassandra’s rowkey

#CASSANDRA13
create table tbl (
data1 varchar(60),
data2 bigint
*Table must have a primary key
- name/type must match Cassandra’s rowkey
*Columns map to Cassandra’s static columns
- name must be same as in Cassandra, datatypes must match, can be subset of CF’s columns

#CASSANDRA13
Datatype mapping
Cassandra MariaDB
blob BLOB, VARBINARY(n)
ascii BLOB, VARCHAR(n), use charset=latin1
text BLOB, VARCHAR(n), use charset=utf8
varint VARBINARY(n)
int INT
bigint BIGINT, TINY, SHORT
uuid CHAR(36) (text in MariaDB)
timestamp TIMESTAMP (second), TIMESTAMP(6) (microsecond), BIGINT
boolean BOOL
float FLOAT
double DOUBLE
decimal VARBINARY(n)
counter BIGINT

#CASSANDRA13
Dynamic columns revisited
*Cassandra supports “dynamic column families”, can access ad-hoc
columns
create table tbl
(
rowkey type PRIMARY KEY
column1 type,
...
dynamic_cols blob DYNAMIC_COLUMN_STORAGE=yes
) engine=cassandra keyspace=... column_family=...;
insert into tbl values (1, column_create('col1', 1, 'col2', 'value-2'));
select rowkey, column_get(dynamic_cols, 'uuidcol' as char) from tbl;

#CASSANDRA13
All data mapping is safe
*CassandraSE will refuse incorrect mappings (throw errors)
create table t3 (pk varchar(60) primary key, no_such_field int)
engine=cassandra `keyspace`='mariadbtest' `column_family`='cf1';
ERROR 1928 (HY000): Internal error: 'Field `no_such_field` could not be mapped to any field in Cassandra'
create table t3 (pk varchar(60) primary key, data1 double)
engine=cassandra `keyspace`='mariadbtest' `column_family`='cf1';
ERROR 1928 (HY000): Internal error: 'Failed to map column data1 to datatype org.apache.cassandra.db.marshal.UTF8Type'

#CASSANDRA13
Command Mapping
*Cassandra commands
- PUT (upsert)
- GET (performs a scan)
- DELETE (if exists)
*SQL commands
- SELECT -> GET/Scan
- INSERT -> PUT (upsert)
- UPDATE/DELETE -> read/write

#CASSANDRA13
SELECT command mapping
*MariaDB has an SQL interpreter
*CassandraSE supports lookups and scans
*Can now do:
- arbitrary WHERE clauses
- JOINs between Cassandra tables and MariaDB tables (BKA
supported)

#CASSANDRA13
Batched Key Access is fast!
select max(l_extendedprice) from orders, lineitem where
o_orderdate between $DATE1 and $DATE2 and
l_orderkey=o_orderkey

#CASSANDRA13
DML command mapping
*No SQL semantics
- INSERT overwrites rows
- UPDATE reads, then writes (have you updated what you read?)
- DELETE reads, then writes (can’t be sure if/what you’ve deleted)
*CassandraSE doesn’t make it SQL!

#CASSANDRA13
CassandraSE use cases
*Collect massive amounts of data like web page hits
*Collect massive amounts of data from sensors
*Updates are non-conflicting
- keyed by UUIDs, timestamps
*Reads are served with one lookup
*Good for certain kinds of data (though moving from SQL entirely may be
difficult)

#CASSANDRA13
Access Cassandra data from SQL
*Send an update to Cassandra
- be a sensor
*Get a piece of data from Cassandra
- This webpage was last viewed by...
- Last known position of this user was...
- You are user number n of n-thousands...

#CASSANDRA13
From MariaDB...
*Want a table that is:
- auto-replicated
- fault-tolerant
- very fast
*Get Cassandra and create a CassandraSE table

#CASSANDRA13
CassandraSE non-use cases
*Huge, sift through all data joins?
- use Pig
*Bulk data transfer to/from Cassandra Cluster?
- use Sqoop
*A replacement for InnoDB?
- remember no full SQL semantics, InnoDB is useful for myriad
reasons

#CASSANDRA13
A tiny benchmark
*One table
*Amazon EC2 environment
- m1.large nodes
- ephemeral disks
*Stream of single-line INSERTs
*Tried InnoDB & CassandraSE
*No tuning

#CASSANDRA13
A tiny benchmark II
*InnoDB with tuning, same
setup as before

#CASSANDRA13
Conclusions
*CassandraSE can be used to peek at data in Cassandra from MariaDB
*It is not a replacement for Pig/Hive
*It is really easy to setup & use

#CASSANDRA13
Roadmap
*Do you want support for:
- fast counter column updates?
- awareness/discovery of Cassandra cluster topology?
- secondary indexes?
- ... ?

#CASSANDRA13
Resources
*https://kb.askmonty.org/en/cassandrase/
*http://wiki.apache.org/cassandra/DataModel
*http://cassandra.apache.org/
*http://www.datastax.com/docs/1.1/ddl/column_family

#CASSANDRA13
THANK YOU
Colin Charles | colin@mariadb.org | SkySQL Ab | http://mariadb.org/
@bytebot on Twitter | http://bytebot.net/blog/

#CASSANDRA13
Cassandra SE internals
*Developed against Cassandra 1.1
*Uses Thrift API
- cannot stream CQL resultset in 1.1
- cannot use secondary indexes
*Only supports AllowAllAuthenticator
*In Cassandra 1.2
- “CQL Binary Protocol” with streaming
- CASSANDRA-5234: Thrift can only read CFs “WITH COMPACT STORAGE”

C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin Charles

More Related Content

What's hot

Similar to C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin Charles

More from DataStax Academy

Recently uploaded

C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin Charles