Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL Data Systems
The document presents an overview of using Apache Calcite to enable SQL access to NoSQL data systems, specifically focusing on Apache Geode. It discusses the architecture, key features, and components of Apache Geode and its integration with Calcite for processing SQL queries. Additionally, it highlights use cases, query optimizations, and future development directions for improved support and performance.
Introduction to utilizing Apache Calcite for SQL access to NoSQL systems like Apache Geode. Speaker's credentials and disclaimer noted.
Overview of Big Data landscape attributes such as Volume, Velocity, Variety, Scalability, Latency, and the CAP theorem concerning consistency vs. availability.
Exploration of data access methods including SQL and custom APIs like Key/Value, Fluid API, and REST API, while questioning the costs of unified access.
Listing various Apache tools and frameworks capable of SQL access, highlighting Apache Geode as a significant player in the SQL landscape.
Introduction to the Geode adapter for Apache Calcite, detailing how SQL is converted into OQL queries for data interaction with Geode.
Explanation of relational expressions in SQL, their optimization, and Geode's support for various relational operators.
Description of Apache Geode as an in-memory database solution, showcasing its usage in large-scale applications with significant numerical data.
Detailed features of Apache Geode including in-memory data storage capabilities, ACID compliance, strong consistency, and scalability.
Introduction to key concepts within Apache Geode, including caches, regions, clients, and event listeners.
Overview of various Geode deployment topologies including peer-to-peer and client-server architectures.
Description of Geode Client API, focusing on key-value operations and the OQL querying service.
Discussion on Geode's data types, including PDX serialization for efficient data handling and querying.
Live demo connecting to Geode cluster and executing OQL queries on available regions.
Introduction to Apache Calcite as a framework for SQL querying, highlighting its components and capabilities.
Exposition on Calcite's data types and structures including catalog, schemas, tables, and relational data types.
Details on how Geode cache regions map to Calcite schemas and tables, illustrating data integration.
Description of the initialization flow for Calcite model, from JSON configuration to schema creation.
Insight into implementation of schema factory for Geode within the Calcite framework, detailing schema creation.
Technical overview of the Geode Scannable Table and its integration with Calcite for row-type handling.
Live demonstration of executing SQL commands through the Geode implementation with scannable tables.
Introduction to non-relational tables in Calcite, emphasizing various table types and query capabilities.
Discussion on the ecosystem surrounding Calcite, including its various components and interaction with different data stores.
Exposition of the execution flow for SQL queries in Calcite, showing step-by-step processing from parsing to executing.
Introduction to Linq4j framework and its relationship with Expression Trees that facilitate query execution in Calcite.
Overview of how Calcite converts expressions into Java code for bindable queries.
Description of relational expressions in Calcite, detailing how they are structured, optimized, and managed.
Exploration of adapter patterns within Calcite, outlining how custom implementations can be integrated.
Examples of SQL implementation with and without Calcite and Geode, showcasing their capabilities.
Steps to establish a JDBC connection for accessing Calcite services.
Discussion on testing strategies for integration with Calcite and Geode.
Outline of potential enhancements for Geode and Calcite integration, focusing on improved support and new features.
List of references and resources for further exploration of Apache Geode and Calcite implementation.
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Other NoSQL Data Systems
1.
Apache Calcite forEnabling SQL
Access to NoSQL Data Systems
such as Apache Geode
Christian Tzolov
2.
2
Christian Tzolov
Engineer atPivotal,
Big-Data, Hadoop,
Spring Cloud Dataflow,
Apache Geode, Apache HAWQ,
Apache Committer,
Apache Crunch PMC member
ctzolov@pivotal.io
blog.tzolov.net
Twitter: @christzolov
https://nl.linkedin.com/in/tzolov
Whoami
Disclaimer
This talk expresses my personal opinions. It is not read or approved by Pivotal
and does not necessarily reflect the views and opinions of Pivotal nor does it
constitute any official communication of Pivotal.
Pivotal does not support any of the code shared here.
3.
3
Big Data Landscape2016
• Volume
• Velocity
• Varity
• Scalability
• Latency
• CAP - Consistency vs. Availability
4.
4
• SQL
• CustomAPIs
– Key / Value
– Fluid API
– REST API
• {My} Query Language
Unified Access?
At What Cost?
Data Access
Calcite Geode Adapter- Overview
Geode API and OQL
SQL/JDBC/ODBC
Convert SQL relational
expressions into OQL queriesGeode Adapter
(Geode Client)
Geode ServerGeode ServerGeode Server
Data Data Data
Push down the relational
expressions supported by Geode
OQL and falls back to the Calcite
Enumerable Adapter for the rest
Enumerable
Adapter
Apache Calcite
Spring Data
Geode
Spring Data API for
interacting with Geode
Parse SQL, converts into
relational expression and
optimizes
7.
Relational Expressions andOptimization
7
Scan Scan
Join
Filter
Project
Customer (c) BookOrder (b)
on customerNumber
b.totalPrice > 0
c.firstName, b.totalPrice
SELECT b."totalPrice", c."firstName” FROM "BookOrder" as b
INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber”
WHERE b."totalPrice" > 0;
Scan Scan
Join
Project
Customer (c) BookOrder (b)
on customerNumber
totalPrice > 0
c.firstName, b.totalPrice
Project
firstName,
customerNumber
Filter
totalPrice,
customerNumberProject
Optimiz
8.
Push Down Candidates
8
RelationalOperator Geode Support
LIMIT Supported without FETCH
PROJECTION Supported
FILTER Supported
JOIN Only for collocated data
AGGREGATE Only for MAX, MIN, SUM, AVG
SORT Requires DISTINCT statement
Implemented
9.
Apache Geode?
“… in-memory,distributed database
with strong consistency built to
support low latency transactional
applications at extreme scale”
10.
Why Apache Geode?
10
5,700train stations
4.5 million tickets per day
20 million daily users
1.4 billion page views per day
40,000 visits per second
7,000 stations
72,000 miles of track
23 million passengers daily
120,000 concurrent users
10,000 transactions per minute
https://pivotal.io/big-data/case-study/distributed-in-memory-data-management-solution
https://pivotal.io/big-data/case-study/scaling-online-sales-for-the-largest-railway-in-the-world-china-railway-corporation
China Railway
11.
11
• In-Memory DataStorage
– > 100TB memory
– JVM Heap + Off Heap
• Any Data Format
– Key-Value/Object Store
• ACID Compliant Transactions
• HA and Linear Scalability
• Strong Consistency
• Streaming and Event Processing
– Listeners
– Distributed Functions
– Continuous OQL Queries
• Multi-site / Inter-cluster
• Embedded and Standalone
• Top Level Apache Project
Apache Geode Features
12.
Apache Geode Concepts
CacheServer (member)
Cache
Region 1
Region N
ValKe
y
v1k1
v2k2
…
Cache - In-memory collection
of Regions
Region - consistent, distributed
Map (key-value),
Partitioned or Replicated
CacheServer – process
connected to the distributed
system with created Cache
Client (member)Locator (member)
Client –read and modify the
content of the distributed
system
Locator – tracks
system members and
provides membership
information
…
Listeners
Functions
Functions – distributed,
concurrent data
processing
Listener – event handler.
Registers for one or
more events and notified
when they occur
13.
Geode Topologies
Cache ServerCacheServerCache Server
Cache Data Cache Data Cache Data Peer-to-Peer
Cache ServerCache ServerCache Server
Cache Data Cache Data Cache Data
Client
Local Cache
pool
Client-Server
Cache Server
Cache Server
Gateway Sender
…
Cache Server
Gateway Receiver
Cache ServerCache Server
Cache Data Cache Data Cache Data Cache Data
Gateway Receiver
Cache Server
… Gateway Sender
Cache ServerCache Server
Cache Data Cache Data Cache Data Cache Data
WAN Multi-site Boundary Multi-Site
14.
Geode Client API
• Client Cache
• Key / Value - Region GET, PUT, REMOVE
• OQL – QueryService
15.
Geode Data Types& Serialization
• Key-Value with complex value formats
• Portable Data eXchange (PDX) Serialization – Delta propagation, schema
evolution, polyglot support …
• Object Query Language (OQL)
SELECT p.name
FROM /Person p
WHERE p.pet.type = “dino”
{
id: 1,
name: “Fred”,
age: 42,
pet: {
name: “Barney”,
type: “dino”
}
}
single field deserialization
nested fields
16.
Geode Demo (GFSHand OQL)
• Connect to Geode cluster,
• List available Regions
• Run OQL query
17.
Apache Calcite?
Java frameworkthat allows SQL interface and advanced query
optimization, for virtually any data system
• Query Parser, Validator and Optimizer(s)
• Local/Remote JDBC drivers
• Streaming
• Agnostic to how data is stored and process
• Balance SQL completes vs. integrity of Data system native capabilities
18.
Apache Calcite DataTypes
• Catalog – namespaces accessed in queries
• Schema - collection of schemas and tables
• Table - single data set, collection of rows
• RelDataType – SQL fields types in a Table
Your Data System
Data System Data Types
Calcite
Schema
SQL Engine
Table Table
JDBC
Table…
Data Type
Mapping SELECT title, author FROM test.BookMaster
Data Type Fields Schema Table
19.
Calcite Data Types:RelDataType
19
Type of a scalar expression or row
• RelDataTypeFactory – RelDataType factory
• JavaTypeFactory - registers Java classes as record types
• JavaTypeFactoryImpl - Java Reflection to build RelDataTypes
• SqlTypeFactoryImpl - Default implementation with all SQL types
20.
Geode to CalciteData Types Mapping
20
Geode Cache
Region 1
Region K
ValKey
v1k1
v2k2
…
Calcite Schema
Table 1
Table K
Col1 Col2 ColN
V(M,1)RowM V(M,2) V(M,N)
V(2,1)Row2 V(2,2) V(2,N)
V(1,1)Row1 V(1,2) V(1,N)
…
Regions are mapped into Tables
Geode Cache is mapped into Calcite Schema
Geode Key/Value is mapped
into Table Row
Create Column Types
(RelDataType) from Geode
Value class
(JavaTypeFactoryImpl)
Calcite Model
{
version: '1.0',
defaultSchema:'TEST',
schemas: [ {
name: 'TEST',
type: 'custom',
factory: 'org.apache.calcite.adapter.geode.simple.GeodeSchemaFactory',
operand: {
locatorHost: 'localhost',
locatorPort: '10334',
regions: 'BookMaster',
pdxSerializablePackagePath: 'net.tzolov.geode.bookstore.domain.*'
}
}]
}
Reference to your adapter
schema factory implementation
class
Parameters to be passed to
your adapter schema factory
implementation
The path to <my-model>.json is passed as JDBC connection argument:
!connect jdbc:calcite:model=target/test-classes/<my-model-path>.json︎
Schema Name
23.
Geode Calcite Schemaand Schema Factory
public class GeodeSchemaFactory implements SchemaFactory {
public Schema create(SchemaPlus parentSchema, String schemaName, Map<String, Object> operand) {
String locatorHost = (String) operand.get(“locatorHost”);
int locatorPort = …
String[] regionNames = …
String pdxPackagePath = …
return new GeodeSchema(locatorHost, locatorPort, regionNames, pdxPackagePath);
}
}
public class GeodeSchema extends AbstractSchema {
private String regionName = ..
protected Map<String, Table> getTableMap() {
final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder();
Region region = … Get Geode Region by region name …
Class valueClass= … Find region’s value type …
builder.put(regionName, new GeodeScannableTable(regionName, valueClass, clientCache));
return tableMap;
}
Retrieves the parameters set in
the model.json
Create an Adapter Schema
instance with the provided
parameters.
Create GeodeScannableTable
instance for each Geode Region
24.
Geode Scannable Table
publicclass GeodeScannableTable extends AbstractTable implements ScannableTable {
public RelDataType getRowType(RelDataTypeFactory typeFactory) {
return new JavaTypeFactoryImpl().createStructType(valueClass);
}
public Enumerable<Object[]> scan(DataContext root) {
return new AbstractEnumerable<Object[]>() {
public Enumerator<Object[]> enumerator() { return new GeodeEnumerator<Object[]>(clientCache, regionName); }
}
public class GeodeEnumerator<E> implements Enumerator<E> {
private E current;
private SelectResults geodeIterator;
public GeodeEnumerator(ClientCache clientCache, String regionName) {
geodeterator = clientCache.getQueryService().newQuery("select * from /" + regionName).execute().iterator();
}
public boolean moveNext() { current = convert(geodeIterator.next()); return true;}
public E current() {return current;}
public abstract E convert(Object geodeValue) {
Convert PDX value into RelDataType row
}
Defined in the Linq4j sub-project
Retrieves the entire Region!!
Converts Geode value response into Calcite row data
Uses reflection (or pdx-instance) to builds
RelDataType from value’s class type
Returns an Enumeration over the entire
target data store
25.
Geode Demo (ScannableTables)
$ ./sqlline ︎
sqlline> !connect jdbc:calcite:model=target/test-classes/model2.json admin admin︎
︎
jdbc:calcite> !tables ︎
jdbc:calcite> SELECT * FROM "BookMaster”;︎
jdbc:calcite> SELECT "yearPublished", AVG("retailCost") AS “AvgRetailCost” FROM "BookMaster" GROUP BY "yearPublished";︎
jdbc:calcite> SELECT b."totalPrice", c."firstName” FROM "BookOrder" AS b INNER JOIN "Customer" AS c ON b."customerNumber" = c."customerNumber” WHERE b."totalPrice" > 0;︎
︎
︎
Without and With Implementation
26.
Non-Relational Tables
26
Scanned withoutintermediate relational expression.
• ScannableTable - can be scanned
• FilterableTable - can be scanned, applying supplied filter expressions
• ProjectableFilterableTable - can be scanned, applying supplied filter expressions
and projecting a given list of columns
Enumerable<Object[]> scan(DataContext root, List<RexNode> filters, int[] projects);
Enumerable<Object[]> scan(DataContext root, List<RexNode> filters);
Enumerable<Object[]> scan(DataContext root);
27.
Calcite Ecosystem
27
Several “semi-independent”projects.
JDBC and Avatica Linq4j
Expression
Tree
Enumerable
Adapter
Relational
• Relational Expressions
• Row Expression
• Optimization Rules
• Planner …
SQL Parser & AST
Port of LINQ (Language-Integrated Query)
to Java.
Local and Remote
JDBC driver
Converts SQL
queries Into AST
(SqlNode …)
3rd party Adapters
Method for translating
executable code into data
(LINQ/MSN port)
Default (In-memory) Data
Store Adapter
implementation.
Leverages Linq4j
Relational Algebra,
expression,
optimizations …
Interpreter
Complies Java code
generated from linq4j
“Expressions”. Part of the
physical plan implementer
28.
Calcite SQL QueryExecution Flow
28
Enumerable
Interpreter
Prepare
SQL,
Relational,
Planner
Geode
Adapter
Binder
JDBC
Geode
Cluster
1
2
3
4
5
6 7
7
7
2. Parse SQL, convert to rel.
expressions. Validate and Optimize
them
3. Start building a physical plan from
the relation expressions
4. Implement the Geode relations and
encode them as Expression tree
5. Pass the Expression tree to the
Interpreter to generate Java code
6. Generate and Compile a Binder
instance that on ‘bind()’ call runs
Geodes’ query method
1. On new SQL query JDBC delegates
to Prepare to prepare the query
execution
7. JDBC uses the newly compiled
Binder to perform the query on the
Geode Cluster
Calcite Framework
Geode Adapter
2
Calcite Relational Expressions(2)
32
RelNode
+ register(RelOptPlander)
+ List<RelNode> getInputs();
RelOptPlanner
+findBestExp():RelNode
RexNode
RelTrait Convention
NONE
*
*
EnumberableConvention
RelOptRule
+ onMatch(call)
<<register>>
<<create>>
MyDBConvention
ConverterRule
+ RelNode convert(RelNode)
Converts from one calling
convention to another
Convertor
Indicate that it converts a
physical attribute only!
<<rules>>
*
<<inputs>>
*
<<root>>
Query optimizer: Transforms a
relational expression according to
a given set of rules and a cost model.
RelOptCluster
Rule transforms an expression into another. It has a list of
Operands, which determine whether the rule can be applied to
a particular section of the tree.
RelOptRuleOperand
*<<fire criteria>>
Calling convention used
to represent a single
data source.
Inputs to a relational
expression must be in
the same convention
33.
Calcite Adapter Patterns
33
MyAdapterRel
+implement(implContext)
MyAdapterConvention
Convention.Impl(“MyAdapter”)
Common interface for all MyAdapter
Relation Expressions. Provides
implementation callback method called
as part of physical plan implementation
ImplContext
+ implParm1
+ implParm2 …
RelNode
MyAdapterTable
+ toRel(optTable)
+ asQueryable(provider,…)
MyAdapterQueryable
+ myQuery(params) :
Enumetator
TranslatableTable
<<instance of>>
AbstractQueryableTable
AbstractTableQueryable <<create>>
Can convert
queries in
Expression
myQuery() implements the call to your DB
It is called by the auto generated code. It
must return an Enumberable instance
MyAdapterScan
+ register(planer) {
Registers all MyAdapter Rules
}
<<create>>
MyAdapterToEnumerableConvertorRule
operands: (RelNode.class,
MyAdapterConvention, EnumerableConvention) ConverterRue
TableScan
MyAdapterToEnumerableConvertor
+ implement(EnumerableRelImplementor) {
ctx = new MyAdapterRel.ImplContext()
getImputs().implement(ctx)
return BlockBuild.append( MY_QUERY_REF,
Expressions.constant(ctx.implParms1),
Expressions.constant(ctx.implParms2) …
EnumerableRel
ConvertorImpl
<<create on match >>
MyAdapterProject
MyAdapterFilter
MyAdapterXXX
RelOptRule
MyAdapterProjectRu
MyAdapterFilterRule
MyAdapterXXXRule
<<create on match >>
Recursively call the implement on each
MyAdapter Relation Expression
Encode the myQuery(params) call as
Expressions
MY_QUERY_REF = Types.lookupMethod(
MyAdapterQueryable.class,
”myQuery”,
String.class
String.class);
1
3
4
5
2
6
7
8
9
Calcite Framework
MyAdapter components
34.
Calcite with Geode- Without Implementation
34
SELECT b."totalPrice", c."firstName"
FROM "BookOrder" as b
INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber"
WHERE b."totalPrice" > 0;
35.
Calcite with Geode- With Implementation
35
SELECT b."totalPrice", c."firstName" FROM "BookOrder" as b INNER JOIN "Customer" as c
ON b."customerNumber" = c."customerNumber" WHERE b."totalPrice" > 0;
Future work
38
• Improvenested data structures support
• Push down Join for collocated data sets
• Push down the COUNT expression
• Push down the ORDER BY expression (requires adding a DISTINCT aggregate
expression to the query)
• Use new Geode OQL aggregation operations (http://bit.ly/2eKApd0) to push down
MIN, MAX, SUM, AVG aggregations
• Beyond OQL (e.g. implement Join, aggregations with custom functions)
• Leverage Calcite Streaming with Geode