Scaling opensimulator inventory using nosql

Scaling OpenSimulator
Inventory using NoSQL
DAVID DAESCHLER
INWORLDZ, LLC
OpenSimulator Community Conference 2014

Who are you? Why should I watch?
You’re not the boss of me!
Hello! I’m David Daeschler, also known as Tranquillity Dexler. I am a partner and software architect
over at InWorldz, LLC.
I designed and deployed an LSL compiler, virtual machine, and script runtime named Phlox to
eliminate CPU and memory issues caused by user scripts.
We developed PhysX physics integration for stable rigid body dynamics and vehicle functionality.
I’ve designed scale out asset services that now run over 11 servers (10 TB of data), and an inventory
system running on top of Apache Cassandra that is now running on 8 nodes holding about 250 GB of
data.
We routinely handle over 300 concurrent users on the grid and we’ve peaked out just shy of 500
concurrent users without experiencing backend faults or load issues.
We’ve experienced and conquered more than a few scaling problems while running our Opensim
derived grid over the past 5 years. It’s been a school of hard knocks, and we’d like to share some of our
experiences and solutions.

Oh noes! Inventory woes!
You’re running an opensimulator grid, and everything is going just great!
... Until concurrency and inventory size picks up, then out of the blue:
Your users are having trouble logging in
You’re starting to see timeouts on MySQL for inventory operations
Your inventory tables are getting really huge. People insist on keeping 100,000+ items containing
at least 400 copies of “primitive”. You hit 100GB of data and realize scaling up wont be a viable
option much longer.
“My Inventory stopped downloading at 75,000 items, but I have 999,999!”
“My pony avatar is missing!”

Federation only helps so much
Even if your grid is part of the Hypergrid, it still may become wildly popular. In this case, manual
federation becomes a nonstarter. Trying to predict growth/loss, to set up multiple “your grid-1,
your grid-2” just really isn’t an ideal solution for fast action in time to keep up with changes in
demand.
Manual self-federation to provide scale out for growth also would require advanced software
tools to set up entirely new backend and frontend services. Shard keys would have to be chosen,
and users would have to be manually distributed against the new set of servers every time your
grid needed to scale again. Essentially, each shard instance that was spun up would require
doing everything again that you had to do for your grid initially.

MySQL read slaves/scale out
MySQL supports read-only slaves out of the box that can help you when your workload is
dominated by reads.
Unfortunately, this would only allow you to scale out until writes became the bottleneck for your
master MySQL server, at which point the master and all slaves would have to be scaled UP with
better hardware.
We quickly started to see IOwait numbers climb because of the amount of writes that were
happening on MySQL. People love to buy stuff and give stuff to each other. Once we got past a
certain point, tuning was no longer an option. It was either get a better master server, or replace
the MySQL backend with something else.

Apache Cassandra
Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy,
Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more
companies that have large, active data sets.
It is a distributed, scale-out, fault tolerant database with tunable consistency.
Benefits
Your data is replicated onto multiple servers that can even span different datacenters
You can lose one or more servers in a cluster and still stay up and running with zero downtime
and zero data loss. This goes well beyond simple RAID.
Seeing the load on backend increase beyond your comfort level? Simply add new servers to the
cluster with ZERO downtime.

But WAIT! Cassandra is eventually
consistent! What about ACID?!
I’d hate to break it to you, but a traditional RDBMS scale out solution with a single master and
one or more slaves is also eventually consistent.
“LIES!” you say. No, seriously. There’s this metric that you can query in a MySQL setup called
slave lag. This number tells you exactly how far behind a slave is from its master. The slave will
never be exactly up to date with the master as long as it is taking constant writes, and reading
from it may return results that are from in the past. Application designers need to keep this in
mind as much as they need to understand Cassandra’s eventual consistency.
It turns out that Cassandra has tunable consistency and can offer better guarantees to obtain a
consistent read than traditional scale out on an RDBMS. This is because we can tell Cassandra to
write to a set of nodes, and not return until a quorum of them have responded that they have
written the new value. When we again read at quorum consistency, we are guaranteed to see
the most up to date value!

I is based on Dynamo
Dynamo was invented by Amazon in 2007 as a solution to provide a highly available distributed
data store. Amazon works at a massive scale and even a few minutes of downtime means they
lose a ton of money.
The dynamo paper has a few important implementation details that Cassandra borrows from:
oData is automatically sharded based on the consistent hash of primary key and replicated to N
hosts in a hash ring.
oHinted handoff helps bring the dataset into convergence during temporary failures.
oThe ability to add and remove storage nodes without interruption of service.

Consistent hashing
Your data is automatically divided up between storage nodes based on the value of the
consistent hash of a row’s primary key
Each of nodes a,b,c,d would own
25% of your primary keys and 25%
Of your data at Replication Factor (RF) =1

Using quorum reads and writes to achieve
Consistency and Partition Tolerance
When you write and read to/from a quorum of nodes you will get a consistent view of the data,
and you will be able to tolerate a node or network outage. An example quorum is 2 out of 3
nodes that form a majority.
WRITE “HELLO!”
TO A AND C
Node A dies  We still read “HELLO” from
C, and we stay running! 

Simple Cassandra setup with Docker
If you haven’t heard of Docker yet, you need to check it out: https://www.docker.com
Docker lets you package an app and all of its dependencies in a portable container.
We’ll use the prebuilt Cassandra container from https://github.com/tobert/cassandra-docker to
build our demo on. (By the way, Al Tobey is an awesome guy and you should follow him on
twitter @AlTobey)
Once Docker is downloaded and set up, starting a single node Cassandra cluster is super easy:
:2.0.10
Alternatively, if you’re on windows, grab the latest release from http://cassandra.apache.org/ and run cassandra.bat

CQL: Like SQL but different
Originally when Cassandra made its debut, the only way to get at the data was to use Thrift calls
that pulled and updated columns very much like working with a hash set.
Cassandra then developed CQL (Cassandra Query Language) which is a familiar cousin of SQL
with the following notable exceptions:
No joins. No GROUP BY. Data in Cassandra is expected to be mostly denormalized. Cassandra
writes are extremely fast, faster than reads, which mitigates the extra write penalty.
Cassandra supports compound keys and data that has the data grouped together by the
partition key (important! more on this later)
You can not use a WHERE clause to filter on columns that aren’t part of the row key or a
secondary index. Partition keys must be queried using the = operator or IN statements.
These rules and features keep you from shooting yourself in the foot.

CQL Inventory schema design
Things to keep in mind:
SL based viewers don’t request subfolders individually in inventory fetch. The protocol CAN do
this, but instead, all folders and subfolders are retrieved as part of the skeleton during login.
All items inside an individual folder are requested at once. We want to optimize reads based on
this fact and not turn every item into an individual random IO. We can use a compound key to
achieve this.
Items are rezzed into the world based on their UUID. Therefore we need to map item IDs back
to their parent folder ID. We’ll do this explicitly and avoid secondary indexes which seem to have
issues with becoming stale as of writing based on mailing list traffic.
All folders have version numbers that get incremented when items or subfolders and changed,
created, moved, or deleted. We’ll use a special CQL column called a counter for this.

Our CQL Schema
Compound
Primary
Keys

A bit more detail about the design
You’ll notice that the design of the schema is geared around how the data will be queried. This
is important because it runs contrary to how we’re used to setting up schemas in the relational
world where the entities normally closely follow our class model.
PRIMARY KEY (Partition Key, Clustering Column, Clustering Column, Clus...)
The reason we’re using compound primary keys is due to the way Cassandra stores data. When
you use a compound primary key, all the data matching the first component of the compound
key, known as the partition key, is grouped together. This means that when we query using this
key alone, or this key with a range of clustering columns, Cassandra is able to retrieve the data
without seeking out each individual row for the clustering columns.
This allows us to efficiently read the data from all items inside a folder without performing
additional seeking for each item.

To the code! .. But first
A few things to remember:
 Since we’re maintaining a denormalized dataset, we need to make sure updates to item/folder
parentage and versioning are reflected in all related tables. We can make these queries via
batches. As of Cassandra 1.2, batches are atomic by default, which means there is less of a
chance of inconsistencies slipping in.
Remember
Moving a folder requires you to alter the skeletons table, and update the folder_versions table.
Renaming a folder requires you to alter skeletons, folder_contents, and folder_versions tables.
Moving or renaming an item requires you to alter folder_contents, folder_versions, and item_parents.
Deleting folders and items requires hits to all associated tables.
OK! NOWTO the CODE and questions!

Some CQL samples
FOLDER_ATTRIB_INSERT_STMT = _session.Prepare(
"INSERT INTO folder_contents “ +
“(folder_id, item_id, name, inv_type, creation_date, owner_id)“ +
"VALUES (?, ?, ?, ?, ?, ?);");
FOLDER_ATTRIB_INSERT_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum);

Example insert with batch
Remember, to insert a folder, we need to insert to the skeletons table, the folder_versions table, and
the folder_contents table.
public void CreateFolder(InventoryFolder folder)
{
var batch = new BatchStatement()
.Add(skelInsert)
.Add(contentInsert);
_session.Execute(batch);
VersionInc(folder.OwnerId, folder.FolderId);
}

What up with VersionInc() ?
We can’t include a counter table as part of a batch with other non-counter tables. So unfortunately
we need to increment the counter separately.
FOLDER_VERSION_INC_STMT =
session.Prepare("UPDATE folder_versions SET version = version + 1 WHERE” +
“user_id = ? AND folder_id = ?;");
FOLDER_VERSION_INC_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum);
private void VersionInc(Guid ownerId, Guid folderId)
{
var versionInc = FOLDER_VERSION_INC_STMT.Bind(ownerId, folderId);
_session.Execute(versionInc);
}

Thank you
The full source code with unit test coverage is available on github at:
https://github.com/InWorldz/opensim-cql-inventory
Thanks you for stopping by!
David Daeschler (Tranquillity Dexler)
Co-Founder
InWorldz, LLC

Scaling opensimulator inventory using nosql

More Related Content

What's hot

Similar to Scaling opensimulator inventory using nosql

Recently uploaded

Scaling opensimulator inventory using nosql