Scaling OpenSimulator 
Inventory using NoSQL 
DAVID DAESCHLER 
INWORLDZ, LLC 
OpenSimulator Community Conference 2014
Who are you? Why should I watch? 
You’re not the boss of me! 
Hello! I’m David Daeschler, also known as Tranquillity Dexler. I am a partner and software architect 
over at InWorldz, LLC. 
I designed and deployed an LSL compiler, virtual machine, and script runtime named Phlox to 
eliminate CPU and memory issues caused by user scripts. 
We developed PhysX physics integration for stable rigid body dynamics and vehicle functionality. 
I’ve designed scale out asset services that now run over 11 servers (10 TB of data), and an inventory 
system running on top of Apache Cassandra that is now running on 8 nodes holding about 250 GB of 
data. 
We routinely handle over 300 concurrent users on the grid and we’ve peaked out just shy of 500 
concurrent users without experiencing backend faults or load issues. 
We’ve experienced and conquered more than a few scaling problems while running our Opensim 
derived grid over the past 5 years. It’s been a school of hard knocks, and we’d like to share some of our 
experiences and solutions.
Oh noes! Inventory woes! 
You’re running an opensimulator grid, and everything is going just great! 
... Until concurrency and inventory size picks up, then out of the blue: 
Your users are having trouble logging in 
You’re starting to see timeouts on MySQL for inventory operations 
Your inventory tables are getting really huge. People insist on keeping 100,000+ items containing 
at least 400 copies of “primitive”. You hit 100GB of data and realize scaling up wont be a viable 
option much longer. 
“My Inventory stopped downloading at 75,000 items, but I have 999,999!” 
“My pony avatar is missing!”
Federation only helps so much 
Even if your grid is part of the Hypergrid, it still may become wildly popular. In this case, manual 
federation becomes a nonstarter. Trying to predict growth/loss, to set up multiple “your grid-1, 
your grid-2” just really isn’t an ideal solution for fast action in time to keep up with changes in 
demand. 
Manual self-federation to provide scale out for growth also would require advanced software 
tools to set up entirely new backend and frontend services. Shard keys would have to be chosen, 
and users would have to be manually distributed against the new set of servers every time your 
grid needed to scale again. Essentially, each shard instance that was spun up would require 
doing everything again that you had to do for your grid initially.
MySQL read slaves/scale out 
MySQL supports read-only slaves out of the box that can help you when your workload is 
dominated by reads. 
Unfortunately, this would only allow you to scale out until writes became the bottleneck for your 
master MySQL server, at which point the master and all slaves would have to be scaled UP with 
better hardware. 
We quickly started to see IOwait numbers climb because of the amount of writes that were 
happening on MySQL. People love to buy stuff and give stuff to each other. Once we got past a 
certain point, tuning was no longer an option. It was either get a better master server, or replace 
the MySQL backend with something else.
Apache 
TO THE RESCUE!
Apache Cassandra 
Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, 
Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more 
companies that have large, active data sets. 
It is a distributed, scale-out, fault tolerant database with tunable consistency. 
Benefits 
Your data is replicated onto multiple servers that can even span different datacenters 
You can lose one or more servers in a cluster and still stay up and running with zero downtime 
and zero data loss. This goes well beyond simple RAID. 
Seeing the load on backend increase beyond your comfort level? Simply add new servers to the 
cluster with ZERO downtime.
But WAIT! Cassandra is eventually 
consistent! What about ACID?! 
I’d hate to break it to you, but a traditional RDBMS scale out solution with a single master and 
one or more slaves is also eventually consistent. 
“LIES!” you say. No, seriously. There’s this metric that you can query in a MySQL setup called 
slave lag. This number tells you exactly how far behind a slave is from its master. The slave will 
never be exactly up to date with the master as long as it is taking constant writes, and reading 
from it may return results that are from in the past. Application designers need to keep this in 
mind as much as they need to understand Cassandra’s eventual consistency. 
It turns out that Cassandra has tunable consistency and can offer better guarantees to obtain a 
consistent read than traditional scale out on an RDBMS. This is because we can tell Cassandra to 
write to a set of nodes, and not return until a quorum of them have responded that they have 
written the new value. When we again read at quorum consistency, we are guaranteed to see 
the most up to date value!
I is based on Dynamo 
Dynamo was invented by Amazon in 2007 as a solution to provide a highly available distributed 
data store. Amazon works at a massive scale and even a few minutes of downtime means they 
lose a ton of money. 
The dynamo paper has a few important implementation details that Cassandra borrows from: 
oData is automatically sharded based on the consistent hash of primary key and replicated to N 
hosts in a hash ring. 
oHinted handoff helps bring the dataset into convergence during temporary failures. 
oThe ability to add and remove storage nodes without interruption of service.
Consistent hashing 
Your data is automatically divided up between storage nodes based on the value of the 
consistent hash of a row’s primary key 
Each of nodes a,b,c,d would own 
25% of your primary keys and 25% 
Of your data at Replication Factor (RF) =1
Using quorum reads and writes to achieve 
Consistency and Partition Tolerance 
When you write and read to/from a quorum of nodes you will get a consistent view of the data, 
and you will be able to tolerate a node or network outage. An example quorum is 2 out of 3 
nodes that form a majority. 
WRITE “HELLO!” 
TO A AND C 
Node A dies  We still read “HELLO” from 
C, and we stay running! 
Simple Cassandra setup with Docker 
If you haven’t heard of Docker yet, you need to check it out: https://www.docker.com 
Docker lets you package an app and all of its dependencies in a portable container. 
We’ll use the prebuilt Cassandra container from https://github.com/tobert/cassandra-docker to 
build our demo on. (By the way, Al Tobey is an awesome guy and you should follow him on 
twitter @AlTobey) 
Once Docker is downloaded and set up, starting a single node Cassandra cluster is super easy: 
:2.0.10 
Alternatively, if you’re on windows, grab the latest release from http://cassandra.apache.org/ and run cassandra.bat
CQL: Like SQL but different 
Originally when Cassandra made its debut, the only way to get at the data was to use Thrift calls 
that pulled and updated columns very much like working with a hash set. 
Cassandra then developed CQL (Cassandra Query Language) which is a familiar cousin of SQL 
with the following notable exceptions: 
No joins. No GROUP BY. Data in Cassandra is expected to be mostly denormalized. Cassandra 
writes are extremely fast, faster than reads, which mitigates the extra write penalty. 
Cassandra supports compound keys and data that has the data grouped together by the 
partition key (important! more on this later) 
You can not use a WHERE clause to filter on columns that aren’t part of the row key or a 
secondary index. Partition keys must be queried using the = operator or IN statements. 
These rules and features keep you from shooting yourself in the foot.
CQL Inventory schema design 
Things to keep in mind: 
SL based viewers don’t request subfolders individually in inventory fetch. The protocol CAN do 
this, but instead, all folders and subfolders are retrieved as part of the skeleton during login. 
All items inside an individual folder are requested at once. We want to optimize reads based on 
this fact and not turn every item into an individual random IO. We can use a compound key to 
achieve this. 
Items are rezzed into the world based on their UUID. Therefore we need to map item IDs back 
to their parent folder ID. We’ll do this explicitly and avoid secondary indexes which seem to have 
issues with becoming stale as of writing based on mailing list traffic. 
All folders have version numbers that get incremented when items or subfolders and changed, 
created, moved, or deleted. We’ll use a special CQL column called a counter for this.
Our CQL Schema 
Compound 
Primary 
Keys
A bit more detail about the design 
You’ll notice that the design of the schema is geared around how the data will be queried. This 
is important because it runs contrary to how we’re used to setting up schemas in the relational 
world where the entities normally closely follow our class model. 
PRIMARY KEY (Partition Key, Clustering Column, Clustering Column, Clus...) 
The reason we’re using compound primary keys is due to the way Cassandra stores data. When 
you use a compound primary key, all the data matching the first component of the compound 
key, known as the partition key, is grouped together. This means that when we query using this 
key alone, or this key with a range of clustering columns, Cassandra is able to retrieve the data 
without seeking out each individual row for the clustering columns. 
This allows us to efficiently read the data from all items inside a folder without performing 
additional seeking for each item.
To the code! .. But first 
A few things to remember: 
 Since we’re maintaining a denormalized dataset, we need to make sure updates to item/folder 
parentage and versioning are reflected in all related tables. We can make these queries via 
batches. As of Cassandra 1.2, batches are atomic by default, which means there is less of a 
chance of inconsistencies slipping in. 
Remember 
Moving a folder requires you to alter the skeletons table, and update the folder_versions table. 
Renaming a folder requires you to alter skeletons, folder_contents, and folder_versions tables. 
Moving or renaming an item requires you to alter folder_contents, folder_versions, and item_parents. 
Deleting folders and items requires hits to all associated tables. 
OK! NOWTO the CODE and questions!
Some CQL samples 
FOLDER_ATTRIB_INSERT_STMT = _session.Prepare( 
"INSERT INTO folder_contents “ + 
“(folder_id, item_id, name, inv_type, creation_date, owner_id)“ + 
"VALUES (?, ?, ?, ?, ?, ?);"); 
FOLDER_ATTRIB_INSERT_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum);
Example insert with batch 
Remember, to insert a folder, we need to insert to the skeletons table, the folder_versions table, and 
the folder_contents table. 
public void CreateFolder(InventoryFolder folder) 
{ 
var batch = new BatchStatement() 
.Add(skelInsert) 
.Add(contentInsert); 
_session.Execute(batch); 
VersionInc(folder.OwnerId, folder.FolderId); 
}
What up with VersionInc() ? 
We can’t include a counter table as part of a batch with other non-counter tables. So unfortunately 
we need to increment the counter separately. 
FOLDER_VERSION_INC_STMT = 
session.Prepare("UPDATE folder_versions SET version = version + 1 WHERE” + 
“user_id = ? AND folder_id = ?;"); 
FOLDER_VERSION_INC_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum); 
private void VersionInc(Guid ownerId, Guid folderId) 
{ 
var versionInc = FOLDER_VERSION_INC_STMT.Bind(ownerId, folderId); 
_session.Execute(versionInc); 
}
Thank you 
The full source code with unit test coverage is available on github at: 
https://github.com/InWorldz/opensim-cql-inventory 
Thanks you for stopping by! 
David Daeschler (Tranquillity Dexler) 
Co-Founder 
InWorldz, LLC

Scaling opensimulator inventory using nosql

  • 1.
    Scaling OpenSimulator Inventoryusing NoSQL DAVID DAESCHLER INWORLDZ, LLC OpenSimulator Community Conference 2014
  • 2.
    Who are you?Why should I watch? You’re not the boss of me! Hello! I’m David Daeschler, also known as Tranquillity Dexler. I am a partner and software architect over at InWorldz, LLC. I designed and deployed an LSL compiler, virtual machine, and script runtime named Phlox to eliminate CPU and memory issues caused by user scripts. We developed PhysX physics integration for stable rigid body dynamics and vehicle functionality. I’ve designed scale out asset services that now run over 11 servers (10 TB of data), and an inventory system running on top of Apache Cassandra that is now running on 8 nodes holding about 250 GB of data. We routinely handle over 300 concurrent users on the grid and we’ve peaked out just shy of 500 concurrent users without experiencing backend faults or load issues. We’ve experienced and conquered more than a few scaling problems while running our Opensim derived grid over the past 5 years. It’s been a school of hard knocks, and we’d like to share some of our experiences and solutions.
  • 3.
    Oh noes! Inventorywoes! You’re running an opensimulator grid, and everything is going just great! ... Until concurrency and inventory size picks up, then out of the blue: Your users are having trouble logging in You’re starting to see timeouts on MySQL for inventory operations Your inventory tables are getting really huge. People insist on keeping 100,000+ items containing at least 400 copies of “primitive”. You hit 100GB of data and realize scaling up wont be a viable option much longer. “My Inventory stopped downloading at 75,000 items, but I have 999,999!” “My pony avatar is missing!”
  • 4.
    Federation only helpsso much Even if your grid is part of the Hypergrid, it still may become wildly popular. In this case, manual federation becomes a nonstarter. Trying to predict growth/loss, to set up multiple “your grid-1, your grid-2” just really isn’t an ideal solution for fast action in time to keep up with changes in demand. Manual self-federation to provide scale out for growth also would require advanced software tools to set up entirely new backend and frontend services. Shard keys would have to be chosen, and users would have to be manually distributed against the new set of servers every time your grid needed to scale again. Essentially, each shard instance that was spun up would require doing everything again that you had to do for your grid initially.
  • 5.
    MySQL read slaves/scaleout MySQL supports read-only slaves out of the box that can help you when your workload is dominated by reads. Unfortunately, this would only allow you to scale out until writes became the bottleneck for your master MySQL server, at which point the master and all slaves would have to be scaled UP with better hardware. We quickly started to see IOwait numbers climb because of the amount of writes that were happening on MySQL. People love to buy stuff and give stuff to each other. Once we got past a certain point, tuning was no longer an option. It was either get a better master server, or replace the MySQL backend with something else.
  • 6.
  • 7.
    Apache Cassandra Cassandrais in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets. It is a distributed, scale-out, fault tolerant database with tunable consistency. Benefits Your data is replicated onto multiple servers that can even span different datacenters You can lose one or more servers in a cluster and still stay up and running with zero downtime and zero data loss. This goes well beyond simple RAID. Seeing the load on backend increase beyond your comfort level? Simply add new servers to the cluster with ZERO downtime.
  • 8.
    But WAIT! Cassandrais eventually consistent! What about ACID?! I’d hate to break it to you, but a traditional RDBMS scale out solution with a single master and one or more slaves is also eventually consistent. “LIES!” you say. No, seriously. There’s this metric that you can query in a MySQL setup called slave lag. This number tells you exactly how far behind a slave is from its master. The slave will never be exactly up to date with the master as long as it is taking constant writes, and reading from it may return results that are from in the past. Application designers need to keep this in mind as much as they need to understand Cassandra’s eventual consistency. It turns out that Cassandra has tunable consistency and can offer better guarantees to obtain a consistent read than traditional scale out on an RDBMS. This is because we can tell Cassandra to write to a set of nodes, and not return until a quorum of them have responded that they have written the new value. When we again read at quorum consistency, we are guaranteed to see the most up to date value!
  • 9.
    I is basedon Dynamo Dynamo was invented by Amazon in 2007 as a solution to provide a highly available distributed data store. Amazon works at a massive scale and even a few minutes of downtime means they lose a ton of money. The dynamo paper has a few important implementation details that Cassandra borrows from: oData is automatically sharded based on the consistent hash of primary key and replicated to N hosts in a hash ring. oHinted handoff helps bring the dataset into convergence during temporary failures. oThe ability to add and remove storage nodes without interruption of service.
  • 10.
    Consistent hashing Yourdata is automatically divided up between storage nodes based on the value of the consistent hash of a row’s primary key Each of nodes a,b,c,d would own 25% of your primary keys and 25% Of your data at Replication Factor (RF) =1
  • 11.
    Using quorum readsand writes to achieve Consistency and Partition Tolerance When you write and read to/from a quorum of nodes you will get a consistent view of the data, and you will be able to tolerate a node or network outage. An example quorum is 2 out of 3 nodes that form a majority. WRITE “HELLO!” TO A AND C Node A dies  We still read “HELLO” from C, and we stay running! 
  • 12.
    Simple Cassandra setupwith Docker If you haven’t heard of Docker yet, you need to check it out: https://www.docker.com Docker lets you package an app and all of its dependencies in a portable container. We’ll use the prebuilt Cassandra container from https://github.com/tobert/cassandra-docker to build our demo on. (By the way, Al Tobey is an awesome guy and you should follow him on twitter @AlTobey) Once Docker is downloaded and set up, starting a single node Cassandra cluster is super easy: :2.0.10 Alternatively, if you’re on windows, grab the latest release from http://cassandra.apache.org/ and run cassandra.bat
  • 13.
    CQL: Like SQLbut different Originally when Cassandra made its debut, the only way to get at the data was to use Thrift calls that pulled and updated columns very much like working with a hash set. Cassandra then developed CQL (Cassandra Query Language) which is a familiar cousin of SQL with the following notable exceptions: No joins. No GROUP BY. Data in Cassandra is expected to be mostly denormalized. Cassandra writes are extremely fast, faster than reads, which mitigates the extra write penalty. Cassandra supports compound keys and data that has the data grouped together by the partition key (important! more on this later) You can not use a WHERE clause to filter on columns that aren’t part of the row key or a secondary index. Partition keys must be queried using the = operator or IN statements. These rules and features keep you from shooting yourself in the foot.
  • 14.
    CQL Inventory schemadesign Things to keep in mind: SL based viewers don’t request subfolders individually in inventory fetch. The protocol CAN do this, but instead, all folders and subfolders are retrieved as part of the skeleton during login. All items inside an individual folder are requested at once. We want to optimize reads based on this fact and not turn every item into an individual random IO. We can use a compound key to achieve this. Items are rezzed into the world based on their UUID. Therefore we need to map item IDs back to their parent folder ID. We’ll do this explicitly and avoid secondary indexes which seem to have issues with becoming stale as of writing based on mailing list traffic. All folders have version numbers that get incremented when items or subfolders and changed, created, moved, or deleted. We’ll use a special CQL column called a counter for this.
  • 15.
    Our CQL Schema Compound Primary Keys
  • 16.
    A bit moredetail about the design You’ll notice that the design of the schema is geared around how the data will be queried. This is important because it runs contrary to how we’re used to setting up schemas in the relational world where the entities normally closely follow our class model. PRIMARY KEY (Partition Key, Clustering Column, Clustering Column, Clus...) The reason we’re using compound primary keys is due to the way Cassandra stores data. When you use a compound primary key, all the data matching the first component of the compound key, known as the partition key, is grouped together. This means that when we query using this key alone, or this key with a range of clustering columns, Cassandra is able to retrieve the data without seeking out each individual row for the clustering columns. This allows us to efficiently read the data from all items inside a folder without performing additional seeking for each item.
  • 17.
    To the code!.. But first A few things to remember:  Since we’re maintaining a denormalized dataset, we need to make sure updates to item/folder parentage and versioning are reflected in all related tables. We can make these queries via batches. As of Cassandra 1.2, batches are atomic by default, which means there is less of a chance of inconsistencies slipping in. Remember Moving a folder requires you to alter the skeletons table, and update the folder_versions table. Renaming a folder requires you to alter skeletons, folder_contents, and folder_versions tables. Moving or renaming an item requires you to alter folder_contents, folder_versions, and item_parents. Deleting folders and items requires hits to all associated tables. OK! NOWTO the CODE and questions!
  • 18.
    Some CQL samples FOLDER_ATTRIB_INSERT_STMT = _session.Prepare( "INSERT INTO folder_contents “ + “(folder_id, item_id, name, inv_type, creation_date, owner_id)“ + "VALUES (?, ?, ?, ?, ?, ?);"); FOLDER_ATTRIB_INSERT_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum);
  • 19.
    Example insert withbatch Remember, to insert a folder, we need to insert to the skeletons table, the folder_versions table, and the folder_contents table. public void CreateFolder(InventoryFolder folder) { var batch = new BatchStatement() .Add(skelInsert) .Add(contentInsert); _session.Execute(batch); VersionInc(folder.OwnerId, folder.FolderId); }
  • 20.
    What up withVersionInc() ? We can’t include a counter table as part of a batch with other non-counter tables. So unfortunately we need to increment the counter separately. FOLDER_VERSION_INC_STMT = session.Prepare("UPDATE folder_versions SET version = version + 1 WHERE” + “user_id = ? AND folder_id = ?;"); FOLDER_VERSION_INC_STMT.SetConsistencyLevel(ConsistencyLevel.Quorum); private void VersionInc(Guid ownerId, Guid folderId) { var versionInc = FOLDER_VERSION_INC_STMT.Bind(ownerId, folderId); _session.Execute(versionInc); }
  • 22.
    Thank you Thefull source code with unit test coverage is available on github at: https://github.com/InWorldz/opensim-cql-inventory Thanks you for stopping by! David Daeschler (Tranquillity Dexler) Co-Founder InWorldz, LLC