Last week I was in Baltimore attending OmniTI’s Surge Conference. I can’t remember exactly when I first met OmniTI CEO Theo Schlossnagle, but it was at an ApacheCon after he had delivered one of his 3 hour tutorials on Scalable Internet Architectures, back in the early 2000’s. Theo’s been at this scalability business for a long time, and I was sad to have missed the first Surge, which was held last year.
Ben Fried, Google’s CIO started the conference (and one of the major themes) with a “disaster porn” talk. He described a system that he built in a previous life, for a major wall street company. The system had to be very scalable to accommodate the needs of traders. One day, the system started failing, and ended up costing his employer a significant amount of money. In the ensuing effort to get the system working again, he ended up with all the people from the various specializations (development, operations, networking, etc) all stuck in a very large room with a lot of whiteboards. It turned out that no one really understood how the entire system worked, and that issues at the boundaries of the specialties were causing many of the problems. The way that they had scaled up their organization was to specialize, but that specialization caused them to lose an end to end view of the system. Their organization of their people had led to some of the problems they were experiencing, and was impeding their ability to solve the problems. The quote that I most remember was “specialization is an industrial age notion and needs to be discounted in spaces where we operate at the boundary of the known versus unknown”. The lessons that Fried learned on that project have influenced the way that Google works (Site Reliability Engineers as an example), and are similar to the ideas being espoused by the “DevOps” movement. His description of the solution was to “reward and recognize generalist skill and end to end knowledge”. There was a pretty lively Q&A around this notion of generalists.
Mark Imbriaco’s talk was titled “Anatomy of a Failure” in the program, but he actually presented a very detailed account of how Heroku responds to incidents. My background isn’t in operations, so I found this to be pretty interesting and useful. I particularly liked the idea of playbooks to be followed when incidents occur, and that alert messages actually contain links to the necessary playbooks. The best quote from Mark’s talk was probably “Automation is also a great way to distribute failure across an entire system”.
Raymond Blum presented the third of three Google talks that were shoe horned into a single session. He described the kind of problems involved in doing backups at Google scale. Backup is one of those problems that needs to be solved, but is mostly unglamourous. Unless you are Google, that is. Blum talked about how they actually read their backup tapes to be sure that they work, their strategy of backing up to data centers in different geographies, and clever usage of map reduce to parallelize the backup and restore process. He cited the Gmail outage earlier this year as a way of grasping the scale of the problem of backing up a service like GMail, much less all of Google. One way to know if a talk succeeds is if it provokes thoughts. Based on my conversations with other attendees, this one succeeded.
Fellow ASF member Geir Magnusson’s talk was named “When Business Models Attack”. The title alludes to the two systems that Geir described, both of which are designed specifically to handle extreme numbers of users. Geir was the VP of Platform and Architecture at Gilt Groupe, and one description of their model is that every day at Noon is Black Friday. So the Gilt system has to count on handling peak numbers of users every day at a particular time. Geir’s new employer, Function(x), also has a business model that depends on large numbers of users. The challenge is to design systems that will handle big usage spikes as a matter of course, not as a rarity. One of architectures that Geir described involved writing data into a Riak cluster in order to absorb the write traffic, and then using a Node.js based process to do a “write-behind” of that data into a relational database.
There were several technology themes that I encountered during the course of the 2 days:
- Many of the talks that I attended involved the use of some kind of messaging system (most frequently RabbitMQ). Messaging is an important component in connecting systems that are operating a different rates, which is frequently the case in systems operating at high scale.
- Many people are using Amazon EC2, and liking it, but there were a lot of jokes about the reliability of EC2.
One thing that I especially liked about Surge was the focus on learning from failure, otherwise known as a “fascination with disaster porn”. Most of the time you only hear about things that worked, but hearing about what didn’t work is at least as instructive, and in some case more instructive. This is something that (thus far) is unique to Surge.