PHP & Continuous Data ProcessingMichael Peacock, October, 2011
No. Not milk floats (anymore)All Electric, Commercial Vehicles.Photo courtesy of kenjonbro: http://www.flickr.com/photos/kenjonbro/4037649210/in/set-72157623026469013
About Michael PeacockSenior/Lead Web Developer
Web Systems Developer
Telemetry Team – Smith Electric Vehicles US Corp
Author
PHP 5 Social Networking, PHP 5 E-Commerce Development, Drupal Social Networking (6 & 7), Selling online with Drupal e-Commerce, Building Websites with TYPO3
PHPNE Volunteer
Occasional technical speaker
PHP North-East, PHPNW 2010, SuperMondays, PHPNW 2011 Unconference, ConFoo 2012Smith Electric Vehicles & Telemetry	Worlds largest manufacturer of Commercial, all-electric vehiclesSmith Link – on-board vehicle telematics system, capturing over 2500 data points each second on the vehicle and broadcasting them over mobile network~400 telemetry enabled vehicles on the roadWorlds largest telemetry project outside of F1
System Architecture
System Architecture
Problem #1: We Can’t Loose Any DataData is required as part of a $32 million grant from the US Department of EnergyThousands of pieces of information collected on a per second basis from a range of remote collection devices
Un-predictable amounts of data at any one time
More vehicles rolling off the production line with telemetry enabled
What about system downtime, upgrades, roll-outs and connectivity problems?Message QueuingSolution: We use a fast, reliable, scalable, secure, hosted message queueIf our systems are offline, data builds up in the external message queue
If we are processing at full capacity, surplus builds in in the message queue
If the vehicle loses GPRS signal, or message queue were to be inaccessible, vehicles have an internal buffer of up to 7 daysSecret Weapon #1: StormMQBased on AMQP, an open standard
Secure: All data is encrypted and sent over SSL
Reliable: Huge investment in server infrastructure
Hosted: Backed up with an SLA
Scalable: Capable of processing huge numbers of incoming messages, with capacity to store the messages when we perform maintenance on our systemsProblem #2: Processing data quicklyWe utilise a dedicated server and number of dedicated applications to pull these messages and process themThis needs to happen quick enough for live data to be seen through the web interface
Data is rapidly converted into batch SQL files, which are imported to MySQL via “LOAD DATA INFILE”
Results in high number of inserts per second (20,000 – 80,000)
LOAD DATA INFILE isn’t enough on its own...Secret Weapon #2: DBASam Lambert – DBA ExtraordinaireConstantly tweaking the servers and configuration to get more and more performance
Pushing the capabilities of our SAN, tweaking configs where no DBA has gone before
www.samlambert.com
http://www.samlambert.com/2011/07/how-to-push-your-san-with-open-iscsi_13.html
http://www.samlambert.com/2011/07/diagnosing-and-fixing-mysql-io.html
sam.lambert@smithelectric.comShardingHuge volumes of data being stored
We shard the data based on the truck it came from, each truck has its own database
Databases held on one of many database servers in our cluster each with ~100GB RAMLive, Real Time Information[live screen photo]
Real Time Status and Tracking
Live, Real Time Information: ProblemOriginal database design dictated:All data-points were stored in the same table
Each type of data point required a separate query, sub-query or join to obtainWorkings of the remote device collecting the data, and the processing server, dictated:GPS Co-ordinates can be up to 6 separate data points, including: Longitude; Latitude; Altitude; Speed; Number of Satellites used to get location; DirectionReal Time Information: ConcurrentInitial Solution from the original developers:Pull as many pieces of real time information through asynchronously

PHP Continuous Data Processing

  • 1.
    PHP & ContinuousData ProcessingMichael Peacock, October, 2011
  • 2.
    No. Not milkfloats (anymore)All Electric, Commercial Vehicles.Photo courtesy of kenjonbro: http://www.flickr.com/photos/kenjonbro/4037649210/in/set-72157623026469013
  • 3.
  • 4.
  • 5.
    Telemetry Team –Smith Electric Vehicles US Corp
  • 6.
  • 7.
    PHP 5 SocialNetworking, PHP 5 E-Commerce Development, Drupal Social Networking (6 & 7), Selling online with Drupal e-Commerce, Building Websites with TYPO3
  • 8.
  • 9.
  • 10.
    PHP North-East, PHPNW2010, SuperMondays, PHPNW 2011 Unconference, ConFoo 2012Smith Electric Vehicles & Telemetry Worlds largest manufacturer of Commercial, all-electric vehiclesSmith Link – on-board vehicle telematics system, capturing over 2500 data points each second on the vehicle and broadcasting them over mobile network~400 telemetry enabled vehicles on the roadWorlds largest telemetry project outside of F1
  • 11.
  • 12.
  • 13.
    Problem #1: WeCan’t Loose Any DataData is required as part of a $32 million grant from the US Department of EnergyThousands of pieces of information collected on a per second basis from a range of remote collection devices
  • 14.
    Un-predictable amounts ofdata at any one time
  • 15.
    More vehicles rollingoff the production line with telemetry enabled
  • 16.
    What about systemdowntime, upgrades, roll-outs and connectivity problems?Message QueuingSolution: We use a fast, reliable, scalable, secure, hosted message queueIf our systems are offline, data builds up in the external message queue
  • 17.
    If we areprocessing at full capacity, surplus builds in in the message queue
  • 18.
    If the vehicleloses GPRS signal, or message queue were to be inaccessible, vehicles have an internal buffer of up to 7 daysSecret Weapon #1: StormMQBased on AMQP, an open standard
  • 19.
    Secure: All datais encrypted and sent over SSL
  • 20.
    Reliable: Huge investmentin server infrastructure
  • 21.
  • 22.
    Scalable: Capable ofprocessing huge numbers of incoming messages, with capacity to store the messages when we perform maintenance on our systemsProblem #2: Processing data quicklyWe utilise a dedicated server and number of dedicated applications to pull these messages and process themThis needs to happen quick enough for live data to be seen through the web interface
  • 23.
    Data is rapidlyconverted into batch SQL files, which are imported to MySQL via “LOAD DATA INFILE”
  • 24.
    Results in highnumber of inserts per second (20,000 – 80,000)
  • 25.
    LOAD DATA INFILEisn’t enough on its own...Secret Weapon #2: DBASam Lambert – DBA ExtraordinaireConstantly tweaking the servers and configuration to get more and more performance
  • 26.
    Pushing the capabilitiesof our SAN, tweaking configs where no DBA has gone before
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    We shard thedata based on the truck it came from, each truck has its own database
  • 32.
    Databases held onone of many database servers in our cluster each with ~100GB RAMLive, Real Time Information[live screen photo]
  • 33.
    Real Time Statusand Tracking
  • 34.
    Live, Real TimeInformation: ProblemOriginal database design dictated:All data-points were stored in the same table
  • 35.
    Each type ofdata point required a separate query, sub-query or join to obtainWorkings of the remote device collecting the data, and the processing server, dictated:GPS Co-ordinates can be up to 6 separate data points, including: Longitude; Latitude; Altitude; Speed; Number of Satellites used to get location; DirectionReal Time Information: ConcurrentInitial Solution from the original developers:Pull as many pieces of real time information through asynchronously
  • 36.
    Involved the useof Flash based “widgets” which called separate PHP scripts to query the data
  • 37.
  • 38.
    Data points tooka little time to load
  • 39.
    Not good enoughRealTime Information: CachingHigh volumes of data, and varying levels of concurrent processing means query times are often not consistent
  • 40.
    Memcachewas used whenprocessing the data from the message queue, keeping a copy of the most recent of each data point for each truck
  • 41.
    Live, Real-Time informationaccessed directly from memcache, bypassing the databaseCaching: Registry/DI is IdealSporadic use of memcache within the web application – ideal use case for a lazy loading registry or DI container
  • 42.
    Give the registryor container details of memcache
  • 43.
    Object only instantiatedand connection made only when data is requested from memcacheLazy Loadingpublic function getObject( $key ){ if( in_array( $key, array_keys( $this->objects ) ) ) { return $this->objects[$key]; }elseif( in_array( $key, array_keys( $this->objectSetup ) ) ) { if( ! is_null( $this->objectSetup[ $key ]['abstract'] ) ) {require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]['folder'] . '/' . $this->objectSetup[ $key ]['abstract'] .'.abstract.php' ); }require_once( FRAMEWORK_PATH . 'registry/aspects/' . $this->objectSetup[ $key ]['folder'] . '/' . $this- >objectSetup[ $key ]['file'] . '.class.php' ); $o = new $this->objectSetup[ $key ]['class']( $this ); $this->storeObject( $o, $key ); return $o; }elseif( $key == 'memcache' ) { // requesting memcache for the first time, instantiate, connect, store and return $mc = new Memcache(); $mc->connect( MEMCACHE_SERVER, MEMCACHE_PORT ); $this->storeObject( $mc, 'memcache' ); return $mc; }}Becomes the limit for the registry pattern, DI container more suitable
  • 44.
    Real Time Information:Extrapolate and AssumeOur telemetry unit broadcasts each data point once per second
  • 45.
    Data doesn’t changeevery second, e.g.
  • 46.
    Battery state ofcharge may take several minutes to loose a percentage point
  • 47.
    Fault flags onlychange to 1 when there is a fault
  • 48.
  • 49.
    We compare thedata to the last known value…if it’s the same we don’t insert, instead we assume it was the same
  • 50.
    Unfortunately, this requiresus to put additional checks and balances in placeExtrapolate and Assume: “Interlation”Built a special library which:Accepted a number of arrays, each representing a collection of data points for one variable on the truck
  • 51.
    Used key indicatorsand time differences to work out if/when the truck was off, and extrapolation should stop
  • 52.
    For each timedata was recorded, pull down data for other variables for consistencyInterlace * Add an array to the interlation public function addArray( $name, $array ) * Get the time that we first receive data in one of our arrays public function getFirst( $field ) * Get the time that we last received data in any of our arrays public function getLast( $field ) * Generate the interlaced array public function generate( $keyField, $valueField) * Beak the interlaced array down into seperate days public function dayBreak( $interlationArray) * Generate an interlaced array and fill for all timestamps withinthe range of _first_ to _last_ public function generateAndFill( $keyField, $valueField) * Populate the new combined array with key fields using the common field public function populateKeysFromField( $field, $valueField=null )http://www.michaelpeacock.co.uk/interlation-library
  • 53.
    Real Time Information:Single RequestCurrently, each piece of “live data” is loaded into a flash graph or widget, which updates every 30 seconds using an AJAX request
  • 54.
    The move fromMySQL to Memcache reduces database load, but large number of requests still add strain to web server
  • 55.
    Moving to imageand JavaScript widgets, which are updated from a single AJAX requestLots of Data: Race ConditionsSessions in PHP close at the end of the execution cycleUnpredictable query times
  • 56.
    Large number ofconcurrent requests per screenSession LockingCompletely locks out a users session, as PHP hasn’t closed the session
  • 57.
    Race Conditions: PHP& Sessionssession_write_close()Added after each write to the $_SESSION array. Closes the current session.(requires a call to session_start immediately before any further reads or writes)
  • 58.
    Race Conditions: Usea ******* Template EngineV1 of the system mixed PHP and HTML 
  • 59.
    You can’t re-initialiseyour session once output has been sent
  • 60.
    All new codeuses a template engine, so session interaction has no bearing on output. When the template is processed and output, all database and session work has been completed long before.Race Conditions: Use a Single Entry PointRace conditions are further exacerbated by the PHP timeout values
  • 61.
    Certain exports, actionsand processes take longer than 30 seconds, so the default execution time is longer
  • 62.
    Initially the projectlacked a single entry point, and execution flow was muddled
  • 63.
    Single Entry Pointmakes it easier to enforce a lower time out, which is overridden by intensive controllers or modelsIntensive queries & CalculationsHow far did this vehicle travel?
  • 64.
    Motor RPM xVarious vehicle specific constants
  • 65.
    Calculated for everyRPM value held during drive process
  • 66.
    How much energydid the vehicle use
  • 67.
    Battery Current xBattery Voltage x Time
  • 68.
    For every currentand voltage value combination held during the driving process
  • 69.
    How well wasthe vehicle driven
  • 70.
  • 71.
    Harshness of acceleratorand brake pedal usage
  • 72.
    Inappropriate duration ofAC / Heater on time?
  • 73.
    What about fora customers fleet, or all of our vehicles sold?Intensive Queries & Calculations
  • 74.
    Intensive queries &CalculationsInvolves a fair number of queries per vehicle
  • 75.
    Calculations involve holdingthis data in memory
  • 76.
    Processing required forevery single record for that piece of data during that dayTakes a while!Solution:Calculate information overnight
  • 77.
    Save it asa compiled report
  • 78.
    Lookups and comparisonsonly need to look at the compiled / saved reports in the databaseReportsIn addition to our calculated reports, we also need to export key bits of information to grant authoritiesInitially our PHP based export scripts held one database connection per database (~400 databases)
  • 79.
    Re-wrote to maintainonly one connection per server, and switch the database used
  • 80.
    Toggles to instructthe export to only apply for 1 of the servers at a time
  • 81.
    Modulus magic torun multiple export scripts per serverTriggers and EventsCurrently a work-in-progress R&D project, evaluating two options:Golden hammer: Use PHP
  • 82.
    Run PHP asa daemon
  • 83.
  • 84.
    Continually monitor forspecific changes to memcache variables
  • 85.
  • 86.
  • 87.
  • 88.
    Link into PHPbased API to run triggers The FutureMore sharding
  • 89.
    Based on time– keep the individual tables smaller
  • 90.
  • 91.
    Currently investigating NoSQLsolutions as alternatives
  • 92.
  • 93.
    Do we needas much data as we collect?
  • 94.
  • 95.
    We need tocontinually abstract concepts and ideas to make on-going maintenance and expansion easier; especially in terms of mapping code to database shards
  • 96.
  • 97.
    Expand our DBcluster, more RAM, R&D
  • 98.
  • 99.
    A much neededdesign refreshConclusionsMake the solution scalable from the start
  • 100.
    Where data collectionis critical, use a message queue, ideally hosted or “cloud based”
  • 101.
    Hire a geniusDBA to push your database engine
  • 102.
    Make use ofdata caching systems to reduce strain on the database
  • 103.
    Calculations and post-processingshould be done during dead time and automated
  • 104.
    Add more toolsto your toolbox – PHP needs lots of friends in these situations
  • 105.
    Watch out forSession race conditions: where they can’t be avoided, use session_write_close, a template engine and a single entry point
  • 106.
    Reduce the numberof continuous AJAX callsQ & AMichael PeacockWeb Systems Developer – Telemetry Team – Smith Electric Vehicles US Corpmichael.peacock@smithelectric.comSenior / Lead Developer, Author & Entrepreneurme@michaelpeacock.co.uk www.michaelpeacock.co.uk@michaelpeacockhttp://joind.in/3808http://www.slideshare.net/michaelpeacock Extra information!

Editor's Notes

  • #16 Imagine viewing a customers fleet of 30 vehicles on a map? 60 queries refreshing every 30 seconds