Monthly Archive for March, 2012

Strata 2012

Here’s a roundup of last week’s Strata conference.


This year, the O’Reilly team introduced a new tutorial day track, called “Jumpstart”. This track was more oriented towards the business side of big data, and I think that the word MBA actually appeared in the marketing. I think that the track was a success, and was very appropriate. The effect of the next generation of data oriented technologies and applications is going to be very significant, and will have a big impact on the way that business operate. It’s very important that technologists and business people work closely in order to produce the best results.

There were two talks that stood out for me. The first was Avinash Kaushik’s What Marketers can learn from Analysis. Kaushik is a very entertaining and dynamic speaker, and he has had a lot of experience working to help companies use analytics effectively. In his world, processing and storage is 10% of what you need, and analysts – humans are the other 90%. In other words, technology is not nearly as important as having people who can ask the right questions and verify hypotheses experimentally. And even good analysis is not enough. Organizations must be able to act on the results of analysis. I have been (and will continue to be) interested in the ability to use data as quickly as it is collected. Some people call this a “real-time” data capability, although in computer science terms, this is a misnomer. One of the best quotes from Kaushik’s talk was “If you do not have the capacity to take real time action, why do we need real time data?”. Without the ability to act, all the data collection and analysis in the world is fruitless. Kaushik’s claim was that we must remove all humans from the process in order to achieve this. Back to analysis, Kaushik feels that the three key skills of data analysis are: the scientific method, design of experiments, and statistical analysis.

The second talk was 3 Skills of a Data Driven CEO by Diego Saenz. I liked his notion that a company’s data is a raw material, just like any other raw material that might be used by a company. Raw materials must be collected, mined, purifed, and transformed before they can turn into a product, and so with a company’s data. The most important information that I got out of this talk was the case study that he presented on the Bob McDonald, the CEO of Proctor and Gamble. P&G has built a business wide real time information system called Business Sphere. One manifestation of Business Sphere is a pair of 8 foot high video screens that sit in the conference room used by the CEO for his regular staff meeting. Real time data on any aspect of the company’s operations can be displayed on these screens, discussed and acted upon at the CEO staff level. Also of note is that a data analyst attends the CEO staff meeting in order to facilitate discussion and questions about the data. I remember back in the 2000′s when Cisco talked about how they could close their books in a day. Now we have the worlds largest consumer products company with a real time data dashboard in the CEO’s conference room. The bar is being raised on all companies in all industries.


I felt that the talks In the regular conference were weaker than last year. Part of that may be due to my talk selection – there were lots of tracks, and in some cases it was hard to figure out which talks to pick. I tend to seek out unusual content, which means more risk in terms of a “quality” talk. The advent of the O’Reilly all access path has taken some of the risk out, since that pass gives you access to the full video archive of the entire conference. The topic of video archives is probably content for another blog post. I know that there are some talks that I missed that I want to watch the videos for, but apparently, I’ll need to wait several weeks. It will be interesting to contrast that with this week’s mostly volunteer run PyCon, which has a great track record of getting all their videos up on the web during the conference, for no fee.

Talks which were easy to remember included Sam Shah’s Collaborative Filtering with MapReduce, which included a description of how to implement collaborative filtering on Hadoop, but more importantly discussed many of the issues around building a production worthy version of such a system. It’s one thing the implement a core algorithm. It’s another to have all the rest of the infrastructure so that the algorithm can be used for production tasks.

A large portion of the data the people are interested in analyzing is coming from social networks. I attended Marcel Salathé’s Understanding Social Contagion in the hopes of gaining some greater insight into virality. Salathé works at an infectious disease center and he spent a long time comparing biological contagion with internet virality. I didn’t find this to be particularly enlightening. However, in the last third of the talk, he started talking about some of the experimental work that his group had done, which was a little more interesting. The code for his system is available on github.

I really enjoyed DJ Patil’s talk Data Jujitsu: The Art of Turning Data into Product. According to Patil, data jujitsu is using data elements in an iterative way to solve otherwise impossible data problems. A lot of his advice had to do with starting small and simple, and moving problems to where they were easiest to solve, particularly in conjunction with human input. As an example, he discussed the problem of entity resolution in one of the LinkedIn products, and described how they moved the problem from the server side, where it was hard, to the client side, where it was easy if you asked the user a simple question. The style he discussed was iterative, opportunistic, and “lazy”.

Jeremy Howard from Kaggle talked about From Predictive Modelling to Optimization: The Next Frontier. Many companies are now building a lifetime value model of a customer, and some companies are even starting to build predictive models. Howard’s claim was that the next steps in the progression are take these models and use them to build simulations. Once we have simulations, we can then use optimization algorithms on the inputs to the simulation, and optimize the results in the direction


Last year, I was pretty unhappy with a number of the keynotes, which were basically vendor pitches. This year things were much better, although there were one or two offenders. Microsoft was NOT one of the offenders. Dave Campbell’s Do We Have The Tools We Need To Navigate The New World Of Data? was one of the better Microsoft keynotes that I’ve seen at an O’Reilly event (i.e. out of the Microsoft ecosystem). The talk included good non-Microsoft specific discussion of the problems, references to academic papers (each with at least one Microsoft author), and a friendly, collegial, non-patronizing tone. I hope that we’ll see more of this from Redmond.

Avinash Kaushik had a keynote spot, and one of the most entertaining, but insightful slides was an infamous quote from Donald Rumsfeld

[T]here are known knowns; there are things we know we know.

We also know there are known unknowns; that is to say we know there are some things we do not know.

But there are also unknown unknowns – there are things we do not know we don’t know.

Kaushik was very keen on “unknown unknowns”. These are the kind of things that we are looking to find, and which analytics and big data techniques might actually help discover. He demonstrating a way of sorting data which leaves out the extremes, and leaves the rest of the data, which is likely where the unknown unknowns are hiding.

I’ve been a fan of Hal Varian ever since I read his book “Information Rules: A Strategic Guide to the Network Economy” back during the dot-com boom. One the one hand, his talk  Using Google Data for Short-term Economic Forecasting, was basically a commercial for Google Insights for Search. On the other hand, the way that he used it and showed how it was pretty decent for economic data was interesting. There were several talks that included the use of Google Insights for Search. It’s a tool that I’ve never paid much attention to, but I think that I’m going to rectify that.

The App

This is the first O’Reilly conference I’ve attended where they had a mobile app. There were iPad, iPhone, and Android versions. I only installed the iPad version, and I really liked it. I used it a lot when I was sitting in sessions to retrieve information about speakers, leave ratings and so forth. I’d love to see links to supplemental materials appear there. I also liked the fact that the app synced to the O’Reilly site, so that my personal schedule was reflected there. I didn’t like the fact that the app synced to the O’Reilly website because the WiFi at the conference was slow, and I often found myself waiting for those updates to finish before I could use the app. The other interesting thing was that I preferred the daily paper schedule when I was walking the hall between sessions. Part of this was due to having to wait for those updates, but part of it was that there was no view in the app that corresponded to the grid/track view of the paper schedule. More work to do here, but a great start.

Final thoughts

This year’s attendance was over 2300, up from 1400 last year, and I saw badges from all sorts of companies. It is apparent to me that the use of data and analytics being discussed at Strata is going to be the new normal for business.