#CASSANDRA13
Colin Charles | colin@mariadb.org | SkySQL Ab | http://mariadb.org/
@bytebot on Twitter | http://bytebot.net/blog/
MariaDB and Cassandra Interoperability
#CASSANDRA13
whoami
*Work on MariaDB today
*Formerly of MySQL AB (USD$1bil acquisition -> Sun Microsystems)
*Worked on The Fedora Project & OpenOffice.org previously
*Monty Program Ab is a major sponsor of MariaDB
*SkySQL & Monty Program Ab merge
*MariaDB governed by MariaDB Foundation
#CASSANDRA13
What we will discuss today...
*What is MariaDB?
*MariaDB Architecture
*The Cassandra Storage Engine (CassandraSE)
*Data & Command Mapping
*Use Cases
*Benchmarks
*Conclusions
#CASSANDRA13
What is MariaDB?
*Community developed, feature enhanced, backward compatible MySQL
*Drop-in replacement to MySQL
*Shipped in many Linux distributions as a default
*Enhanced features: threadpool, table elimination, optimizer changes
(subqueries materialize!), group commit in the replication binary log,
HandlerSocket, SphinxSE, multi-source replication, dynamic columns
#CASSANDRA13
#CASSANDRA13
MariaDB/MySQL and NoSQL
*HandlerSocket
*memcached access to InnoDB
*Hadoop Applier
*LevelDB Storage Engine
*Cassandra Storage Engine
*CONNECT Engine
#CASSANDRA13
Dynamic Columns
*Store a different set of columns for every row in the table
*Basically a blob with handling functions (GET, CREATE, ADD, DELETE,
EXISTS, LIST, JSON)
*Dynamic columns can be nested
*You can request rows in JSON format
*You can now name dynamic columns as well
INSERT INTO tbl SET
dyncol_blob=COLUMN_CREATE("column_name", "value");
#CASSANDRA13
Cassandra background
*Distributed key/value store (limited range scan support),
optionally flexible schema (pre-defined “static” columns,
ad-hoc dynamic columns), automatic sharding/
replication, eventual consistency
*Column families are like “tables”
*Row key -> column mapping
*Supercolumns are not supported
#CASSANDRA13
CQL at work
cqlsh> CREATE KEYSPACE mariadbtest
... WITH REPLICATION ={'class':'SimpleStrategy','replication_factor':1};
cqlsh> use mariadbtest;
cqlsh:mariadbtest> create columnfamily cf1 ( pk varchar primary key, data1 varchar, data2 bigint ) with compactstorage;
cqlsh:mariadbtest> insert into cf1 (pk, data1,data2) values ('row1', 'data-in-cassandra', 1234);
cqlsh:mariadbtest> select * from cf1;
pk | data1 | data2
------+-------------------+-------
row1 | data-in-cassandra | 1234
cqlsh:mariadbtest> select * from cf1 where pk='row1';
pk | data1 | data2
------+-------------------+-------
row1 | data-in-cassandra | 1234
cqlsh:mariadbtest> select * from cf1 where data2=1234;
Bad Request: No indexed columns present in by-columns clause with Equal operator
cqlsh:mariadbtest> select * from cf1 where pk='row1' or pk='row2';
Bad Request: line 1:34 missing EOF at 'or'
#CASSANDRA13
CQL
*Looks like SQL at first glance
*No joins or subqueries
*No GROUP BY, ORDER BY must be able to use available indexes
*WHERE clause must represent an index lookup
*Simple goal of the Cassandra Storage Engine? Provide a “view” of
Cassandra’s data from MariaDB
#CASSANDRA13
Getting started
*Get MariaDB 10.0.2 from https://downloads.mariadb.org/
*Load the Cassandra plugin
- From SQL:
MariaDB [(none)]> install plugin cassandra soname 'ha_cassandra.so';
- Or start it from my.cnf
[mysqld]
...
plugin-load=ha_cassandra.so
#CASSANDRA13
Is everything ok?
*Check to see that it is loaded - SHOW PLUGINS
MariaDB [(none)]> show plugins;
+--------------------+--------+-----------------+-----------------+---------+
| Name | Status | Type | Library | License |
+--------------------+--------+-----------------+-----------------+---------+
...
| CASSANDRA | ACTIVE | STORAGE ENGINE | ha_cassandra.so | GPL |
+--------------------+--------+-----------------+-----------------+---------+
#CASSANDRA13
Create an SQL table which is a view of a column family
MariaDB [test]> set global cassandra_default_thrift_host='10.196.2.113';
MariaDB [test]> create table t2 (pk varchar(36) primary key,
-> data1 varchar(60),
-> data2 bigint
-> ) engine=cassandra
-> keyspace='mariadbtest'
-> thrift_host='10.196.2.113'
-> column_family='cf1';
*thrift_host can be set per-table
*@@cassandra_default_thrift_host allows to re-point the table to different node
dynamically, and not change table DDL when Cassandra IP changes
#CASSANDRA13
Potential issues
*SELinux blocks the connection
ERROR 1429 (HY000): Unable to connect to foreign data source:
connect() failed: Permission denied [1]
*Disable SELinux: echo 0 > /selinux/enforce
*Cassandra 1.2 with Column Families (CFs) without “COMPACT
STORAGE”
ERROR 1429 (HY000): Unable to connect to foreign data source:
Column family cf1 not found in keyspace mariadbtest
*Change in Cassandra 1.2 that broke Pig as well; we’ll update this soon
#CASSANDRA13
Accessing Cassandra data from MariaDB
*Get data from Cassandra
MariaDB [test]> select * from t2;
+------+-------------------+-------+
| pk | data1 | data2 |
+------+-------------------+-------+
| row1 | data-in-cassandra | 1234 |
+------+-------------------+-------+
*Insert data into Cassandra
MariaDB [test]> insert into t2 values ('row2','data-from-mariadb', 123);
*Ensure Cassandra sees inserted data
cqlsh:mariadbtest> select * from cf1;
pk | data1 | data2
------+-------------------+-------
row1 | data-in-cassandra | 1234
row2 | data-from-mariadb | 123
#CASSANDRA13
Data mapping between Cassandra and SQL
create table tbl (
pk varchar(36) primary key,
data1 varchar(60),
data2 bigint
) engine=cassandra keyspace='ks1' column_family='cf1'
*MariaDB table represents Cassandra’s Column Family
- can use any table name, column_family=... specifies CF
#CASSANDRA13
Data mapping between Cassandra and SQL
create table tbl (
pk varchar(36) primary key,
data1 varchar(60),
data2 bigint
) engine=cassandra keyspace='ks1' column_family='cf1'
*MariaDB table represents Cassandra’s Column Family
- can use any table name, column_family=... specifies CF
*Table must have a primary key
- name/type must match Cassandra’s rowkey
#CASSANDRA13
Data mapping between Cassandra and SQL
create table tbl (
pk varchar(36) primary key,
data1 varchar(60),
data2 bigint
) engine=cassandra keyspace='ks1' column_family='cf1'
*MariaDB table represents Cassandra’s Column Family
- can use any table name, column_family=... specifies CF
*Table must have a primary key
- name/type must match Cassandra’s rowkey
*Columns map to Cassandra’s static columns
- name must be same as in Cassandra, datatypes must match, can be subset of CF’s columns
#CASSANDRA13
Datatype mapping
Cassandra MariaDB
blob BLOB, VARBINARY(n)
ascii BLOB, VARCHAR(n), use charset=latin1
text BLOB, VARCHAR(n), use charset=utf8
varint VARBINARY(n)
int INT
bigint BIGINT, TINY, SHORT
uuid CHAR(36) (text in MariaDB)
timestamp TIMESTAMP (second), TIMESTAMP(6) (microsecond), BIGINT
boolean BOOL
float FLOAT
double DOUBLE
decimal VARBINARY(n)
counter BIGINT
#CASSANDRA13
Dynamic columns revisited
*Cassandra supports “dynamic column families”, can access ad-hoc
columns
create table tbl
(
rowkey type PRIMARY KEY
column1 type,
...
dynamic_cols blob DYNAMIC_COLUMN_STORAGE=yes
) engine=cassandra keyspace=... column_family=...;
insert into tbl values (1, column_create('col1', 1, 'col2', 'value-2'));
select rowkey, column_get(dynamic_cols, 'uuidcol' as char) from tbl;
#CASSANDRA13
All data mapping is safe
*CassandraSE will refuse incorrect mappings (throw errors)
create table t3 (pk varchar(60) primary key, no_such_field int)
engine=cassandra `keyspace`='mariadbtest' `column_family`='cf1';
ERROR 1928 (HY000): Internal error: 'Field `no_such_field` could not be mapped to any field in Cassandra'
create table t3 (pk varchar(60) primary key, data1 double)
engine=cassandra `keyspace`='mariadbtest' `column_family`='cf1';
ERROR 1928 (HY000): Internal error: 'Failed to map column data1 to datatype org.apache.cassandra.db.marshal.UTF8Type'
#CASSANDRA13
Command Mapping
*Cassandra commands
- PUT (upsert)
- GET (performs a scan)
- DELETE (if exists)
*SQL commands
- SELECT -> GET/Scan
- INSERT -> PUT (upsert)
- UPDATE/DELETE -> read/write
#CASSANDRA13
SELECT command mapping
*MariaDB has an SQL interpreter
*CassandraSE supports lookups and scans
*Can now do:
- arbitrary WHERE clauses
- JOINs between Cassandra tables and MariaDB tables (BKA
supported)
#CASSANDRA13
Batched Key Access is fast!
select max(l_extendedprice) from orders, lineitem where
o_orderdate between $DATE1 and $DATE2 and
l_orderkey=o_orderkey
#CASSANDRA13
DML command mapping
*No SQL semantics
- INSERT overwrites rows
- UPDATE reads, then writes (have you updated what you read?)
- DELETE reads, then writes (can’t be sure if/what you’ve deleted)
*CassandraSE doesn’t make it SQL!
#CASSANDRA13
CassandraSE use cases
*Collect massive amounts of data like web page hits
*Collect massive amounts of data from sensors
*Updates are non-conflicting
- keyed by UUIDs, timestamps
*Reads are served with one lookup
*Good for certain kinds of data (though moving from SQL entirely may be
difficult)
#CASSANDRA13
Access Cassandra data from SQL
*Send an update to Cassandra
- be a sensor
*Get a piece of data from Cassandra
- This webpage was last viewed by...
- Last known position of this user was...
- You are user number n of n-thousands...
#CASSANDRA13
From MariaDB...
*Want a table that is:
- auto-replicated
- fault-tolerant
- very fast
*Get Cassandra and create a CassandraSE table
#CASSANDRA13
CassandraSE non-use cases
*Huge, sift through all data joins?
- use Pig
*Bulk data transfer to/from Cassandra Cluster?
- use Sqoop
*A replacement for InnoDB?
- remember no full SQL semantics, InnoDB is useful for myriad
reasons
#CASSANDRA13
A tiny benchmark
*One table
*Amazon EC2 environment
- m1.large nodes
- ephemeral disks
*Stream of single-line INSERTs
*Tried InnoDB & CassandraSE
*No tuning
#CASSANDRA13
A tiny benchmark II
*InnoDB with tuning, same
setup as before
#CASSANDRA13
Conclusions
*CassandraSE can be used to peek at data in Cassandra from MariaDB
*It is not a replacement for Pig/Hive
*It is really easy to setup & use
#CASSANDRA13
Roadmap
*Do you want support for:
- fast counter column updates?
- awareness/discovery of Cassandra cluster topology?
- secondary indexes?
- ... ?
#CASSANDRA13
Resources
*https://kb.askmonty.org/en/cassandrase/
*http://wiki.apache.org/cassandra/DataModel
*http://cassandra.apache.org/
*http://www.datastax.com/docs/1.1/ddl/column_family
#CASSANDRA13
THANK YOU
Colin Charles | colin@mariadb.org | SkySQL Ab | http://mariadb.org/
@bytebot on Twitter | http://bytebot.net/blog/
#CASSANDRA13
Cassandra SE internals
*Developed against Cassandra 1.1
*Uses Thrift API
- cannot stream CQL resultset in 1.1
- cannot use secondary indexes
*Only supports AllowAllAuthenticator
*In Cassandra 1.2
- “CQL Binary Protocol” with streaming
- CASSANDRA-5234: Thrift can only read CFs “WITH COMPACT STORAGE”

C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin Charles

  • 1.
    #CASSANDRA13 Colin Charles |colin@mariadb.org | SkySQL Ab | http://mariadb.org/ @bytebot on Twitter | http://bytebot.net/blog/ MariaDB and Cassandra Interoperability
  • 2.
    #CASSANDRA13 whoami *Work on MariaDBtoday *Formerly of MySQL AB (USD$1bil acquisition -> Sun Microsystems) *Worked on The Fedora Project & OpenOffice.org previously *Monty Program Ab is a major sponsor of MariaDB *SkySQL & Monty Program Ab merge *MariaDB governed by MariaDB Foundation
  • 3.
    #CASSANDRA13 What we willdiscuss today... *What is MariaDB? *MariaDB Architecture *The Cassandra Storage Engine (CassandraSE) *Data & Command Mapping *Use Cases *Benchmarks *Conclusions
  • 4.
    #CASSANDRA13 What is MariaDB? *Communitydeveloped, feature enhanced, backward compatible MySQL *Drop-in replacement to MySQL *Shipped in many Linux distributions as a default *Enhanced features: threadpool, table elimination, optimizer changes (subqueries materialize!), group commit in the replication binary log, HandlerSocket, SphinxSE, multi-source replication, dynamic columns
  • 5.
  • 6.
    #CASSANDRA13 MariaDB/MySQL and NoSQL *HandlerSocket *memcachedaccess to InnoDB *Hadoop Applier *LevelDB Storage Engine *Cassandra Storage Engine *CONNECT Engine
  • 7.
    #CASSANDRA13 Dynamic Columns *Store adifferent set of columns for every row in the table *Basically a blob with handling functions (GET, CREATE, ADD, DELETE, EXISTS, LIST, JSON) *Dynamic columns can be nested *You can request rows in JSON format *You can now name dynamic columns as well INSERT INTO tbl SET dyncol_blob=COLUMN_CREATE("column_name", "value");
  • 8.
    #CASSANDRA13 Cassandra background *Distributed key/valuestore (limited range scan support), optionally flexible schema (pre-defined “static” columns, ad-hoc dynamic columns), automatic sharding/ replication, eventual consistency *Column families are like “tables” *Row key -> column mapping *Supercolumns are not supported
  • 9.
    #CASSANDRA13 CQL at work cqlsh>CREATE KEYSPACE mariadbtest ... WITH REPLICATION ={'class':'SimpleStrategy','replication_factor':1}; cqlsh> use mariadbtest; cqlsh:mariadbtest> create columnfamily cf1 ( pk varchar primary key, data1 varchar, data2 bigint ) with compactstorage; cqlsh:mariadbtest> insert into cf1 (pk, data1,data2) values ('row1', 'data-in-cassandra', 1234); cqlsh:mariadbtest> select * from cf1; pk | data1 | data2 ------+-------------------+------- row1 | data-in-cassandra | 1234 cqlsh:mariadbtest> select * from cf1 where pk='row1'; pk | data1 | data2 ------+-------------------+------- row1 | data-in-cassandra | 1234 cqlsh:mariadbtest> select * from cf1 where data2=1234; Bad Request: No indexed columns present in by-columns clause with Equal operator cqlsh:mariadbtest> select * from cf1 where pk='row1' or pk='row2'; Bad Request: line 1:34 missing EOF at 'or'
  • 10.
    #CASSANDRA13 CQL *Looks like SQLat first glance *No joins or subqueries *No GROUP BY, ORDER BY must be able to use available indexes *WHERE clause must represent an index lookup *Simple goal of the Cassandra Storage Engine? Provide a “view” of Cassandra’s data from MariaDB
  • 11.
    #CASSANDRA13 Getting started *Get MariaDB10.0.2 from https://downloads.mariadb.org/ *Load the Cassandra plugin - From SQL: MariaDB [(none)]> install plugin cassandra soname 'ha_cassandra.so'; - Or start it from my.cnf [mysqld] ... plugin-load=ha_cassandra.so
  • 12.
    #CASSANDRA13 Is everything ok? *Checkto see that it is loaded - SHOW PLUGINS MariaDB [(none)]> show plugins; +--------------------+--------+-----------------+-----------------+---------+ | Name | Status | Type | Library | License | +--------------------+--------+-----------------+-----------------+---------+ ... | CASSANDRA | ACTIVE | STORAGE ENGINE | ha_cassandra.so | GPL | +--------------------+--------+-----------------+-----------------+---------+
  • 13.
    #CASSANDRA13 Create an SQLtable which is a view of a column family MariaDB [test]> set global cassandra_default_thrift_host='10.196.2.113'; MariaDB [test]> create table t2 (pk varchar(36) primary key, -> data1 varchar(60), -> data2 bigint -> ) engine=cassandra -> keyspace='mariadbtest' -> thrift_host='10.196.2.113' -> column_family='cf1'; *thrift_host can be set per-table *@@cassandra_default_thrift_host allows to re-point the table to different node dynamically, and not change table DDL when Cassandra IP changes
  • 14.
    #CASSANDRA13 Potential issues *SELinux blocksthe connection ERROR 1429 (HY000): Unable to connect to foreign data source: connect() failed: Permission denied [1] *Disable SELinux: echo 0 > /selinux/enforce *Cassandra 1.2 with Column Families (CFs) without “COMPACT STORAGE” ERROR 1429 (HY000): Unable to connect to foreign data source: Column family cf1 not found in keyspace mariadbtest *Change in Cassandra 1.2 that broke Pig as well; we’ll update this soon
  • 15.
    #CASSANDRA13 Accessing Cassandra datafrom MariaDB *Get data from Cassandra MariaDB [test]> select * from t2; +------+-------------------+-------+ | pk | data1 | data2 | +------+-------------------+-------+ | row1 | data-in-cassandra | 1234 | +------+-------------------+-------+ *Insert data into Cassandra MariaDB [test]> insert into t2 values ('row2','data-from-mariadb', 123); *Ensure Cassandra sees inserted data cqlsh:mariadbtest> select * from cf1; pk | data1 | data2 ------+-------------------+------- row1 | data-in-cassandra | 1234 row2 | data-from-mariadb | 123
  • 16.
    #CASSANDRA13 Data mapping betweenCassandra and SQL create table tbl ( pk varchar(36) primary key, data1 varchar(60), data2 bigint ) engine=cassandra keyspace='ks1' column_family='cf1' *MariaDB table represents Cassandra’s Column Family - can use any table name, column_family=... specifies CF
  • 17.
    #CASSANDRA13 Data mapping betweenCassandra and SQL create table tbl ( pk varchar(36) primary key, data1 varchar(60), data2 bigint ) engine=cassandra keyspace='ks1' column_family='cf1' *MariaDB table represents Cassandra’s Column Family - can use any table name, column_family=... specifies CF *Table must have a primary key - name/type must match Cassandra’s rowkey
  • 18.
    #CASSANDRA13 Data mapping betweenCassandra and SQL create table tbl ( pk varchar(36) primary key, data1 varchar(60), data2 bigint ) engine=cassandra keyspace='ks1' column_family='cf1' *MariaDB table represents Cassandra’s Column Family - can use any table name, column_family=... specifies CF *Table must have a primary key - name/type must match Cassandra’s rowkey *Columns map to Cassandra’s static columns - name must be same as in Cassandra, datatypes must match, can be subset of CF’s columns
  • 19.
    #CASSANDRA13 Datatype mapping Cassandra MariaDB blobBLOB, VARBINARY(n) ascii BLOB, VARCHAR(n), use charset=latin1 text BLOB, VARCHAR(n), use charset=utf8 varint VARBINARY(n) int INT bigint BIGINT, TINY, SHORT uuid CHAR(36) (text in MariaDB) timestamp TIMESTAMP (second), TIMESTAMP(6) (microsecond), BIGINT boolean BOOL float FLOAT double DOUBLE decimal VARBINARY(n) counter BIGINT
  • 20.
    #CASSANDRA13 Dynamic columns revisited *Cassandrasupports “dynamic column families”, can access ad-hoc columns create table tbl ( rowkey type PRIMARY KEY column1 type, ... dynamic_cols blob DYNAMIC_COLUMN_STORAGE=yes ) engine=cassandra keyspace=... column_family=...; insert into tbl values (1, column_create('col1', 1, 'col2', 'value-2')); select rowkey, column_get(dynamic_cols, 'uuidcol' as char) from tbl;
  • 21.
    #CASSANDRA13 All data mappingis safe *CassandraSE will refuse incorrect mappings (throw errors) create table t3 (pk varchar(60) primary key, no_such_field int) engine=cassandra `keyspace`='mariadbtest' `column_family`='cf1'; ERROR 1928 (HY000): Internal error: 'Field `no_such_field` could not be mapped to any field in Cassandra' create table t3 (pk varchar(60) primary key, data1 double) engine=cassandra `keyspace`='mariadbtest' `column_family`='cf1'; ERROR 1928 (HY000): Internal error: 'Failed to map column data1 to datatype org.apache.cassandra.db.marshal.UTF8Type'
  • 22.
    #CASSANDRA13 Command Mapping *Cassandra commands -PUT (upsert) - GET (performs a scan) - DELETE (if exists) *SQL commands - SELECT -> GET/Scan - INSERT -> PUT (upsert) - UPDATE/DELETE -> read/write
  • 23.
    #CASSANDRA13 SELECT command mapping *MariaDBhas an SQL interpreter *CassandraSE supports lookups and scans *Can now do: - arbitrary WHERE clauses - JOINs between Cassandra tables and MariaDB tables (BKA supported)
  • 24.
    #CASSANDRA13 Batched Key Accessis fast! select max(l_extendedprice) from orders, lineitem where o_orderdate between $DATE1 and $DATE2 and l_orderkey=o_orderkey
  • 25.
    #CASSANDRA13 DML command mapping *NoSQL semantics - INSERT overwrites rows - UPDATE reads, then writes (have you updated what you read?) - DELETE reads, then writes (can’t be sure if/what you’ve deleted) *CassandraSE doesn’t make it SQL!
  • 26.
    #CASSANDRA13 CassandraSE use cases *Collectmassive amounts of data like web page hits *Collect massive amounts of data from sensors *Updates are non-conflicting - keyed by UUIDs, timestamps *Reads are served with one lookup *Good for certain kinds of data (though moving from SQL entirely may be difficult)
  • 27.
    #CASSANDRA13 Access Cassandra datafrom SQL *Send an update to Cassandra - be a sensor *Get a piece of data from Cassandra - This webpage was last viewed by... - Last known position of this user was... - You are user number n of n-thousands...
  • 28.
    #CASSANDRA13 From MariaDB... *Want atable that is: - auto-replicated - fault-tolerant - very fast *Get Cassandra and create a CassandraSE table
  • 29.
    #CASSANDRA13 CassandraSE non-use cases *Huge,sift through all data joins? - use Pig *Bulk data transfer to/from Cassandra Cluster? - use Sqoop *A replacement for InnoDB? - remember no full SQL semantics, InnoDB is useful for myriad reasons
  • 30.
    #CASSANDRA13 A tiny benchmark *Onetable *Amazon EC2 environment - m1.large nodes - ephemeral disks *Stream of single-line INSERTs *Tried InnoDB & CassandraSE *No tuning
  • 31.
    #CASSANDRA13 A tiny benchmarkII *InnoDB with tuning, same setup as before
  • 32.
    #CASSANDRA13 Conclusions *CassandraSE can beused to peek at data in Cassandra from MariaDB *It is not a replacement for Pig/Hive *It is really easy to setup & use
  • 33.
    #CASSANDRA13 Roadmap *Do you wantsupport for: - fast counter column updates? - awareness/discovery of Cassandra cluster topology? - secondary indexes? - ... ?
  • 34.
  • 35.
    #CASSANDRA13 THANK YOU Colin Charles| colin@mariadb.org | SkySQL Ab | http://mariadb.org/ @bytebot on Twitter | http://bytebot.net/blog/
  • 36.
    #CASSANDRA13 Cassandra SE internals *Developedagainst Cassandra 1.1 *Uses Thrift API - cannot stream CQL resultset in 1.1 - cannot use secondary indexes *Only supports AllowAllAuthenticator *In Cassandra 1.2 - “CQL Binary Protocol” with streaming - CASSANDRA-5234: Thrift can only read CFs “WITH COMPACT STORAGE”