Andrés de la Peña
Stratio's Cassandra Lucene index:
Geospatial use cases
Jonathan Nappée
• Big Data Company
• Certified Spark distribution
• Founded in 2013
• 200+ employees
• Offices in Madrid, San Francisco and Bogotá
2/40
1 Lucene-based secondary indexes
2 Geospatial search features
3 Business use cases
3/40
Lucene-based Cassandra secondary indexes
Apache Lucene
• General purpose search library
• Created by Doug Cutting in 1999
• Core of popular search engines:
‒ Apache Nutch, Compass, Apache Solr, ElasticSearch
• Tons of features:
‒ Full-text search, inequalities, sorting, geospatial, aggregations…
• Rich implementation:
‒ Multiple index structures, smart query planning, cool merge policy…
5/40
A Lucene-based C* 2i implementation
• Each node indexes its own data
• Keep P2P architecture
• Distribution managed by C*
• Replication managed by C*
• Just a single pluggable JAR file
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
indexJVM
JVM
JVM
6/40
Creating Lucene indexes
CREATE TABLE tweets (
user text,
date timestamp,
message text,
hashtags set<text>
PRIMARY KEY (user, date));
• Built in the background
• Dynamic updates
• Immutable mapping schema
• Many columns per index
• Many indexes per table
CREATE CUSTOM INDEX tweets_idx ON tweets()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{fields : {
user : {type: "string"},
date : {type: "date", pattern: "yyyy-MM-dd"},
message : {type: "text", analyzer: "english"},
hashtags: {type: "string"}}}'};
7/40
Querying Lucene indexes
SELECT * FROM tweets WHERE expr(tweets_idx, '{
filter: {
must: {type: "phrase", field: "message", value: "cassandra is cool"},
not: {type: "wildcard", field: "hashtags", value: "*cassandra*"}
},
sort: {field: "date", reverse: true}
}') AND user = 'adelapena' AND date >= '2016-01-01';
• Custom JSON syntax
• Multiple query types
• Multivariable conditions
• Multivariable sorting
• Separate filtering and relevance queries
8/40
Java query builder
import static com.datastax.driver.core.querybuilder.QueryBuilder.*;
import static com.stratio.cassandra.lucene.builder.Builder.*;
{…}
String search = search().filter(phrase("message", "cassandra is cool"))
.filter(not(wildcard("hashtags", "*cassandra*")))
.sort(field("date").reverse(true))
.build();
session.execute(select().from("tweets")
.where(eq("lucene", search))
.and(eq("user", "adelapena"))
.and(lte("date", "2016-01-01")));
• Available for JVM languages: Java, Scala, Groovy…
• Compatible with most Cassandra clients
9/40
Apache Spark integration
• Compute large amount of data
• Maximizes parallelism
• Filtering push-down
• Avoid full-scan
C*
node
JVM
Lucene
index
C*
node
JVM
Lucene
index
C*
node
JVM
Lucene
index
spark
master
10/40
Geospatial search features
Geo point mapper
CREATE CUSTOM INDEX restaurants_idx
ON restaurants (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
location : {
type : "geo_point",
latitude : "lat",
longitude : "lon"
},
stars: {type : "integer" }
}
}
'};
CREATE TABLE restaurants(
name text PRIMARY KEY,
stars bigint,
lat double,
lon double);
14/40
Bounding box search
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_bbox",
field : "location",
min_latitude : 40.425978,
max_latitude : 40.445886,
min_longitude : -3.808252,
max_longitude : -3.770999
}
}';
15/40
Distance search
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
min_distance : "100m",
max_distance : "2km"
}
}';
16/40
Distance sorting
SELECT * FROM restaurants
WHERE lucene =
'{
sort:
{
type : "geo_distance",
field : "location",
reverse : false,
latitude : 40.442163,
longitude : -3.784519
}
}' LIMIT 10;
17/40
Indexing complex geospatial shapes
CREATE TABLE places(
id uuid PRIMARY KEY,
shape text -- WKT formatted
);
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 15,
transformations: []
}
}
}'
};
• Points, lines, polygons & multiparts
• JTS index-time transformations
18/40
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 15,
transformations: [{type: "centroid"}]
}
}
}'
};
Index-time shape transformations
• Example: Index only centroid of shapes
19/40
Index-time shape transformations
• Example: Index 50 km buffer zone around shapes
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 15,
transformations: [{
type: "buffer",
min_distance: "50km"}]
}
}
}'
};
20/40
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 8,
transformations:
[{type: "convex_hull"}]
}
}
}'
};
Index-time shape transformations
• Example: Index the convex hull of the shape
21/40
Search by geo shape
• Can search points and shapes using shapes
• Operations define how you search: Intersects, Is_within, Contains
• Can use transformations before searching
‒ Bounding box
‒ Buffer
‒ Centroid
‒ Convex Hull
‒ Difference
‒ Intersection
‒ Union
22/40
Geo Search
• Example: search within a polygon
SELECT * FROM cities
WHERE expr(cities_index, '{
filter: {
type: "geo_shape",
field: "place",
operation: "is_within",
shape: {
type: "wkt",
value: "POLYGON((-0.07 51.63,
0.03 51.54,
0.05 51.65,
-0.07 51.63))"
}
}
}';
23/40
Business use cases
• Investment fund with large exposures to natural catastrophe insurance on properties
• Many geographical data sets:
‒ properties details
‒ natural catastrophe event data
o Hurricane tracks and affected zones
o Earthquakes impact zones
• Risks and portfolios
23/40
Use cases data set
• We indexed all the US census blocks shapes from the Hazus Database
‒ https://www.fema.gov/hazus
‒ These blocks contain revenue and building stats that are useful for
pricing insurance premiums and potential losses
o Average revenue
o Number of stories
‒ Some of them are very complex
o First attempt with convex hull
o Composite indexing strategy with ±2km geohash and doc values in
borders
• We also indexed all police and firestations in the US
24/40
Use cases data set
CREATE TABLE blocks (
state text,
bucket int,
id int,
area double,
type text,
income_ratio double,
latitude double,
longitude double,
shape text,
...
lucene text,
PRIMARY KEY ((state, bucket),
id)
);
CREATE CUSTOM INDEX block_idx ON blocks(lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields : {
state : {type: "string"},
type : {type: "string"},
...
center: {type: "geo_point",
max_levels: 11,
latitude: "latitude",
longitude: "longitude"},
shape : {type: "geo_shape",
max_levels: 5}
}
}'};
25/40
Use cases data set
CREATE TABLE fire_stations(
state text,
id text,
city text,
latitude double,
longitude double,
shape text,
...
lucene text,
PRIMARY KEY (state, id)
);
CREATE TABLE police_stations(
state text,
id text,
city text,
latitude double,
longitude double,
shape text,
...
lucene text,
PRIMARY KEY (state, id)
);
• Analogous indexing for police and fire stations tables
26/40
Composite spatial strategy
• Meant for indexing complex polygons
• Two spatial strategies combined
‒ GeoHash recursive prefix tree for speed
‒ Serialized doc values for accuracy
• Reduced number of geohash terms
• Doc values only for polygon borders
David Smiley blog post:
http://opensourceconnections.com/blog/2014/04/1
1/indexing-polygons-in-lucene-with-accuracy
27/40
Use cases: Search blocks in a shape
• We search which census blocks intersect with a shape
SELECT * FROM blocks
WHERE expr(blocks_index, '{
filter: {
type: "geo_shape",
field: "shape",
operation: "intersects",
shape: {
type: "buffer",
max_distance: "10km",
shape: {
type: "wkt",
value: "LINESTRING -80.90 29.05...)"
}
}
}
}';
28/40
Use cases: Search blocks far from police and fire stations
• Proximity to police and fire stations can have an impact on damage when
natural catastrophe event happens
• We can use this information to search for blocks in our portfolio that are more
than 8 miles from any station to highlight their risk
29/40
Use cases: Search blocks far from fire stations
SELECT * FROM fire_stations WHERE lucene = '{
filter : {
type: "geo_shape",
field: "centroid",
shape: {value: "POLYGON(…)"}}
}';
SELECT * FROM blocks WHERE lucene = '{
filter : {
must: {
type: "geo_shape",
field: "shape ",
shape: {value: "POLYGON(…)"}},
not: {
type: "geo_shape",
field: "shape",
shape: {
type: "buffer",
max_distance: "8mi",
shape: {value: "MULTIPOINT(…)"}}}
}}';
30/40
Use cases:
Find which blocks are affected by a moving hurricane and their
maximum wind speed exposures
• If we are modelling a hurricane we end up with a changing shape every 6
hours, with different location and wind speeds
• We want to find for each state which blocks are hit and at which maximum
wind speed
• We use transformations to represent the moving hurricane and within that the
different wind speeds
31/40
SELECT * FROM blocks WHERE expr(idx, '{
filter : {
type: "geo_shape",
field: "shape",
shape: {
type: "union",
shapes: [{
type: "convex_hull",
shape: {
type: "union",
shapes: [
{type: "buffer",
max_distance: "6mi",
shape: {value: "POINT(…)"}},
{type: "buffer",
max_distance: "3mi",
shape: {value: "POINT(…)"}}
]},
...
]
}
}}';
Use cases: Blocks affected by a moving hurricane
Conclusions
Conclusions
• New pluggable geospatial features in Cassandra
‒ Complex polygon search
‒ Geometrical transformations API
• Can be combined with other search predicates
• Compatible with MapReduce frameworks
• Preserves Cassandra's functionality
34/40
It's open source
github.com/stratio/cassandra-lucene-index
• Published as plugin for Apache Cassandra
• Apache License Version 2.0
35/40
THANK YOU
UNITED STATES
Tel: (+1) 408 5998830
EUROPE
Tel: (+34) 91 828 64 73
contact@stratio.com
www.stratio.com

Stratio's Cassandra Lucene index: Geospatial use cases

  • 1.
    Andrés de laPeña Stratio's Cassandra Lucene index: Geospatial use cases Jonathan Nappée
  • 2.
    • Big DataCompany • Certified Spark distribution • Founded in 2013 • 200+ employees • Offices in Madrid, San Francisco and Bogotá 2/40
  • 3.
    1 Lucene-based secondaryindexes 2 Geospatial search features 3 Business use cases 3/40
  • 4.
  • 5.
    Apache Lucene • Generalpurpose search library • Created by Doug Cutting in 1999 • Core of popular search engines: ‒ Apache Nutch, Compass, Apache Solr, ElasticSearch • Tons of features: ‒ Full-text search, inequalities, sorting, geospatial, aggregations… • Rich implementation: ‒ Multiple index structures, smart query planning, cool merge policy… 5/40
  • 6.
    A Lucene-based C*2i implementation • Each node indexes its own data • Keep P2P architecture • Distribution managed by C* • Replication managed by C* • Just a single pluggable JAR file CLIENT C* node C* node C* node Lucene index Lucene index Lucene indexJVM JVM JVM 6/40
  • 7.
    Creating Lucene indexes CREATETABLE tweets ( user text, date timestamp, message text, hashtags set<text> PRIMARY KEY (user, date)); • Built in the background • Dynamic updates • Immutable mapping schema • Many columns per index • Many indexes per table CREATE CUSTOM INDEX tweets_idx ON tweets() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{fields : { user : {type: "string"}, date : {type: "date", pattern: "yyyy-MM-dd"}, message : {type: "text", analyzer: "english"}, hashtags: {type: "string"}}}'}; 7/40
  • 8.
    Querying Lucene indexes SELECT* FROM tweets WHERE expr(tweets_idx, '{ filter: { must: {type: "phrase", field: "message", value: "cassandra is cool"}, not: {type: "wildcard", field: "hashtags", value: "*cassandra*"} }, sort: {field: "date", reverse: true} }') AND user = 'adelapena' AND date >= '2016-01-01'; • Custom JSON syntax • Multiple query types • Multivariable conditions • Multivariable sorting • Separate filtering and relevance queries 8/40
  • 9.
    Java query builder importstatic com.datastax.driver.core.querybuilder.QueryBuilder.*; import static com.stratio.cassandra.lucene.builder.Builder.*; {…} String search = search().filter(phrase("message", "cassandra is cool")) .filter(not(wildcard("hashtags", "*cassandra*"))) .sort(field("date").reverse(true)) .build(); session.execute(select().from("tweets") .where(eq("lucene", search)) .and(eq("user", "adelapena")) .and(lte("date", "2016-01-01"))); • Available for JVM languages: Java, Scala, Groovy… • Compatible with most Cassandra clients 9/40
  • 10.
    Apache Spark integration •Compute large amount of data • Maximizes parallelism • Filtering push-down • Avoid full-scan C* node JVM Lucene index C* node JVM Lucene index C* node JVM Lucene index spark master 10/40
  • 11.
  • 12.
    Geo point mapper CREATECUSTOM INDEX restaurants_idx ON restaurants (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { location : { type : "geo_point", latitude : "lat", longitude : "lon" }, stars: {type : "integer" } } } '}; CREATE TABLE restaurants( name text PRIMARY KEY, stars bigint, lat double, lon double); 14/40
  • 13.
    Bounding box search SELECT* FROM restaurants WHERE lucene = '{ filter : { type : "geo_bbox", field : "location", min_latitude : 40.425978, max_latitude : 40.445886, min_longitude : -3.808252, max_longitude : -3.770999 } }'; 15/40
  • 14.
    Distance search SELECT *FROM restaurants WHERE lucene = '{ filter : { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, min_distance : "100m", max_distance : "2km" } }'; 16/40
  • 15.
    Distance sorting SELECT *FROM restaurants WHERE lucene = '{ sort: { type : "geo_distance", field : "location", reverse : false, latitude : 40.442163, longitude : -3.784519 } }' LIMIT 10; 17/40
  • 16.
    Indexing complex geospatialshapes CREATE TABLE places( id uuid PRIMARY KEY, shape text -- WKT formatted ); CREATE CUSTOM INDEX places_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 15, transformations: [] } } }' }; • Points, lines, polygons & multiparts • JTS index-time transformations 18/40
  • 17.
    CREATE CUSTOM INDEXplaces_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 15, transformations: [{type: "centroid"}] } } }' }; Index-time shape transformations • Example: Index only centroid of shapes 19/40
  • 18.
    Index-time shape transformations •Example: Index 50 km buffer zone around shapes CREATE CUSTOM INDEX places_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 15, transformations: [{ type: "buffer", min_distance: "50km"}] } } }' }; 20/40
  • 19.
    CREATE CUSTOM INDEXplaces_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 8, transformations: [{type: "convex_hull"}] } } }' }; Index-time shape transformations • Example: Index the convex hull of the shape 21/40
  • 20.
    Search by geoshape • Can search points and shapes using shapes • Operations define how you search: Intersects, Is_within, Contains • Can use transformations before searching ‒ Bounding box ‒ Buffer ‒ Centroid ‒ Convex Hull ‒ Difference ‒ Intersection ‒ Union 22/40
  • 21.
    Geo Search • Example:search within a polygon SELECT * FROM cities WHERE expr(cities_index, '{ filter: { type: "geo_shape", field: "place", operation: "is_within", shape: { type: "wkt", value: "POLYGON((-0.07 51.63, 0.03 51.54, 0.05 51.65, -0.07 51.63))" } } }'; 23/40
  • 22.
  • 23.
    • Investment fundwith large exposures to natural catastrophe insurance on properties • Many geographical data sets: ‒ properties details ‒ natural catastrophe event data o Hurricane tracks and affected zones o Earthquakes impact zones • Risks and portfolios 23/40
  • 24.
    Use cases dataset • We indexed all the US census blocks shapes from the Hazus Database ‒ https://www.fema.gov/hazus ‒ These blocks contain revenue and building stats that are useful for pricing insurance premiums and potential losses o Average revenue o Number of stories ‒ Some of them are very complex o First attempt with convex hull o Composite indexing strategy with ±2km geohash and doc values in borders • We also indexed all police and firestations in the US 24/40
  • 25.
    Use cases dataset CREATE TABLE blocks ( state text, bucket int, id int, area double, type text, income_ratio double, latitude double, longitude double, shape text, ... lucene text, PRIMARY KEY ((state, bucket), id) ); CREATE CUSTOM INDEX block_idx ON blocks(lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields : { state : {type: "string"}, type : {type: "string"}, ... center: {type: "geo_point", max_levels: 11, latitude: "latitude", longitude: "longitude"}, shape : {type: "geo_shape", max_levels: 5} } }'}; 25/40
  • 26.
    Use cases dataset CREATE TABLE fire_stations( state text, id text, city text, latitude double, longitude double, shape text, ... lucene text, PRIMARY KEY (state, id) ); CREATE TABLE police_stations( state text, id text, city text, latitude double, longitude double, shape text, ... lucene text, PRIMARY KEY (state, id) ); • Analogous indexing for police and fire stations tables 26/40
  • 27.
    Composite spatial strategy •Meant for indexing complex polygons • Two spatial strategies combined ‒ GeoHash recursive prefix tree for speed ‒ Serialized doc values for accuracy • Reduced number of geohash terms • Doc values only for polygon borders David Smiley blog post: http://opensourceconnections.com/blog/2014/04/1 1/indexing-polygons-in-lucene-with-accuracy 27/40
  • 28.
    Use cases: Searchblocks in a shape • We search which census blocks intersect with a shape SELECT * FROM blocks WHERE expr(blocks_index, '{ filter: { type: "geo_shape", field: "shape", operation: "intersects", shape: { type: "buffer", max_distance: "10km", shape: { type: "wkt", value: "LINESTRING -80.90 29.05...)" } } } }'; 28/40
  • 29.
    Use cases: Searchblocks far from police and fire stations • Proximity to police and fire stations can have an impact on damage when natural catastrophe event happens • We can use this information to search for blocks in our portfolio that are more than 8 miles from any station to highlight their risk 29/40
  • 30.
    Use cases: Searchblocks far from fire stations SELECT * FROM fire_stations WHERE lucene = '{ filter : { type: "geo_shape", field: "centroid", shape: {value: "POLYGON(…)"}} }'; SELECT * FROM blocks WHERE lucene = '{ filter : { must: { type: "geo_shape", field: "shape ", shape: {value: "POLYGON(…)"}}, not: { type: "geo_shape", field: "shape", shape: { type: "buffer", max_distance: "8mi", shape: {value: "MULTIPOINT(…)"}}} }}'; 30/40
  • 31.
    Use cases: Find whichblocks are affected by a moving hurricane and their maximum wind speed exposures • If we are modelling a hurricane we end up with a changing shape every 6 hours, with different location and wind speeds • We want to find for each state which blocks are hit and at which maximum wind speed • We use transformations to represent the moving hurricane and within that the different wind speeds 31/40
  • 32.
    SELECT * FROMblocks WHERE expr(idx, '{ filter : { type: "geo_shape", field: "shape", shape: { type: "union", shapes: [{ type: "convex_hull", shape: { type: "union", shapes: [ {type: "buffer", max_distance: "6mi", shape: {value: "POINT(…)"}}, {type: "buffer", max_distance: "3mi", shape: {value: "POINT(…)"}} ]}, ... ] } }}'; Use cases: Blocks affected by a moving hurricane
  • 33.
  • 34.
    Conclusions • New pluggablegeospatial features in Cassandra ‒ Complex polygon search ‒ Geometrical transformations API • Can be combined with other search predicates • Compatible with MapReduce frameworks • Preserves Cassandra's functionality 34/40
  • 35.
    It's open source github.com/stratio/cassandra-lucene-index •Published as plugin for Apache Cassandra • Apache License Version 2.0 35/40
  • 36.
    THANK YOU UNITED STATES Tel:(+1) 408 5998830 EUROPE Tel: (+34) 91 828 64 73 contact@stratio.com www.stratio.com