February | 2011 | Ted Leung on the Air

I spent three days last week at O’Reilly’s Strata Conference. This is the first year of the conference, which is focused on topics around data. The tag line of the conference was “Making Data Work”, but the focus of the content was on “Big Data”.

The state of the data field

Big Data as a term is kind of undefined in a “I’ll know it when I see it” kind of way. As an example,I saw tweets asking how much data one needed to have in order to qualify as having a Big Data problem. Whatever the complete meaning is, if one exists, there is a huge amount of interest in this area. O’Reilly planned for 1200 people, but actual attendance was 1400, and due to the level of interest, there will be another Strata in September 2011, this time in New York. Another term that was used frequently was data science, or more often data scientists, people who have a set of skill that make them well suited to dealing with data problems. These skills include programming, statistics, machine learning, and data visualization, and depending on who you ask, there will be additions or subtractions from that list. Moreover, this skill set is in high demand. There was a very full job board, and many presentations ended with the words “we’re hiring”. And as one might suspect, the venture capitalists are sniffing around — at the venture capital panel, one person said that he believed there was a 10-25 year run in data problems and the surrounding ecosystem.

The Strata community is a multi disciplinary community. There were talks on infrastructure for supporting big data (Hadoop, Cassandra, Esper, custom systems), algorithms for machine learning (although not as many as I would have liked), the business and ethics of possessing large data sets, and all kinds of visualizations. In the executive summit, there were also a number of presentations from traditional business intelligence, analytics, and data warehousing folks. It is very unusual to have all these communities in one place and talking to each other. One side effect of this, especially for a first time conference, is that it is difficult to assess the quality of speakers and talks. There were a number of talks which had good looking abstracts, but did not live up to those aspirations in the actual presentation. I suspect that it is going to take several iterations to identify the the best speakers and the right areas – par for a new conference in a multidisciplinary field.

General Observations

I did my graduate work in object databases, which is a mix of systems, databases, and programming languages. I also did a minor in AI, although it was in the days before machine learning really became statistically oriented. I’m looking forward to going a bit deeper into all these areas as I look around in the space.

One theme that appeared in many talks was the importance of good, clean data. In fact, Bob Page from eBay showed a chart comparing 5 different learning algorithms, and it was clear that having a lot of data made up for differences in the algorithms, making the quality and volume of the data more important than the details of the algorithms being used. That’s not to say that algorithms are unimportant, just that high quality data is more important. It seems obvious that having access to good data is really important.

Another theme that appeared in many talks was the combination of algorithms and humans. I remember this being said repeatedly in the panel on predicting the future. I think that there’s a great opportunity in figuring out how to make the algorithm and human collaboration work as pleasantly and efficiently as possible.

There were two talks that at least touched on building data science teams, and on Twitter it seemed that LinkedIn was viewed as having one of the best data science teams in the industry. Not to take anything away from the great job that the LinkedIn folks are doing, or the importance of helping people find good jobs, but I hope that in a few years, we are looking up to data science teams from healthcare, energy, and education.

It amused me to see tweets and have discussions on the power of Python as a tool in this space. With libraries like numpy, scipy, nltk, and scikits.learn, along with an interactive interpreter loop, Python is well suited for data science/big data tasks. It’s interesting to note that tools like R and Incanter have similar properties.

There were two areas that I am particularly interested in, and which I felt were somewhat under represented. The issue of doing analysis in low latency / “realtime” scenarios, and the notion of “personal analytics” (analytics around a single person’s data). I hope that we’ll see more on these topics in the future.

The talks

As is the case nowadays, the proceedings from the conference are available online in the form of slide decks, and in some cases video. Material will probably continue to show up over the course of the next week or so. Below are some of the talks I found noteworthy.

Day 1

I spent the tutorial day in the Executive Summit, looking for interesting problems or approaches that companies are taking with their data efforts. There were two talks that stood out to me. The first was Bob Page’s talk Building the Data Driven Organization, which was really about eBay. Bob shared from eBay’s experience over the last 10 years. Probably the most interesting thing he described was an internal social network like tool, which allowed people to discover and then bookmark analytics reports from other people.

Marilyn and Terence Craig presented Retail: Lessons Learned from the First Data-Driven Business and Future Directions, which was exactly how it sounded. It’s conventional wisdom among Internet people that retail as we know it is dead. I came away from this talk being impressed by the problems that retail logistics presents, and by how retail’s problems are starting to look like Internet problems. Or is that vice versa?

Day 2

The conference proper started with the usual slew of keynotes. I’ve been to enough O’Reilly conferences to know that some proportion of the keynotes are given in exchange for sponsorships, but some of the keynotes were egregiously commercial. The Microsoft keynote included a promotional video, and the EnterpriseDB keynote on Day 3 was a bald faced sales pitch. I understand that the sponsors want to get value for the money they paid (I helped sponsor several conferences during my time at Sun). The sponsors should look at the twitter chatter during their keynotes to realize that these advertising keynotes hurt them far more than they help them. Before Strata, I didn’t really know anything about EnterpriseDB except that they had something to do with Postgres. Now I all I know is that they wasted a bunch of my time during a keynote spot.

Day 2 was a little bit light on memorable talks. I went to Generating Dynamic Social Networks from Large Scale Unstructured Data which was in the vendor presentation track. Although I didn’t learn much about the actual techniques and technologies that were used, I did at least gain some appreciation for the issues involved. The panel Real World Applications Panel: Machine Learning and Decision Support only had two panelists. Jonathan Seidman and Robert Lancaster from Orbitz described how they use learning for sort optimization, intelligent caching, and personalization/segmentation, and Alasdair Allan from the University of Exeter described the use of learning and multiagent systems to control networks telescopes at observatories around the world. The telescope control left me with a vaguely SkyNet ish feeling. Matthew Russell has written a book called Mining the Social Web. I grabbed his code off of github and it looked interesting, so I dropped into his talk Unleashing Twitter Data for Fun and Insight. He’s also written 21 Recipes for Mining Twitter, and the code for that is on github as well.

Day 3

Day 3 produced a reprieve on the keynote front. Despite the aforementioned horrible EnterpriseDB keynote, there were 3 very good talks. LinkedIn’s keynote on Innovating Data Teams was good. They presented some data science on the Strata attendees and described how they recruited and organized their data team. They did launch a product, LinkedIn Skills, but it was done in such a way as to show off the data science relevant aspects of the product.

Scott Yara from EMC did a keynote called Your Data Rules the World. This is how a sponsor keynote should be done. No EMC products were promoted, and Scott did a great job of demonstrating a future filled with data, right down to still and video footage of him being stopped for a traffic violation. The keynote provoked you to really thing about where all this is heading, and what some of the big issues are going to be. I know that EMC make storage and other products. But more than that, I know that they employ Product Management people who have been thinking deeply about a future that is swimming with data.

The final keynote was titled Can Big Data Fix Healthcare?. Carol McCall has been working on data oriented healthcare solutions for quite some time now, and her talk was inspirational and gave me some hope that improvements can happen.

Day 3 was the day of the Where’s the Money in Big Data? panel, where a bunch of venture capitalists talked about how they see the market and where it might be headed. It was also the day of two really good sessions. In Present Tense: The Challenges and Trade-offs in Building a Web-scale Real-time Analytics System, Ben Black described Fast-IP’s journey to build a web-scale real-time analytics system. It was an honest story of attempts and failures as well as the technical lessons that they learned after each attempt. This was the most detailed technical talk I attended, although the terms distributed lower dimensional cuboid and word-aligned bitmap index were tossed around, but not covered in detail. It’s worth noting that Fast-IP’s system and Twitter’s Analytics system, Rainbird, are both based, to varying degrees, on Cassandra.

I ended up spending an extra night in San Jose so that I could stay for Predicting the Future: Anticipating the World with Data, which was in the last session block of the conference. I think that it was worth it. This was a panel format, but each panelist was well prepared. Recorded Future is building a search engine that uses the past to predict the future. They didn’t give out much of their secret sauce, but they did say that they have built a temporally based index as opposed to a keyword based one. Unfortunately their system is domain specific, with finance and geopolitics being the initial domains. Palantir Technologies is trying to predict terrorist attacks. In the abstract, this means predicting in the face of an adaptive adversary, and in contexts like this, the key is to stop thinking in terms of machine learning and start thinking in terms of game theory. It seems like there’s a pile of interesting stuff in that last statement. Finally, Rion Snow from Twitter took us through a number of academic papers where people have successfully made predictions about box office revenue, the stock market, and the flu, just from analyzing information available via Twitter. I had seen almost all of the papers before, but it was nice to feel that I hadn’t missed any of the important results.

Talks I missed but had twitter buzz

You can’t go to every talk at a conference (nor should you, probably), but here are some talks that I missed, but which had a lot of buzz on Twitter. MAD Skills: A Magnetic, Agile and Deep Approach to Scalable Analytics – the hotness of this talk seemed related more to the DataWrangler tool (for cleansing data) than the MAD library (scalable analytics engine running inside Postgres) itself. Big Data, Lean Startup: Data Science on a Shoestring seemed like it had a lot of just good commonsense about running in a startup in addition to know how to do data science without doing overkill. Joseph Turian’s New Developments in Large Data Techniques looked like a great talk. His slides are available online, as well as the papers that he referenced. It seemed like the demos were the topic of excitement in Data Journalism: Applied Interfaces, given jointly by folks from ReadWriteWeb, The Guardian, and The New York Times. Rainbird is Twitter’s analytics system, which was described in Real-time Analytics at Twitter. Notable news on that one is that Twitter will be open sourcing Rainbird once the requisite version of Cassandra is released.

Evening activities

There were events both evenings of the show, which made for very long days. On Day 1 there was a showcase of various startup companies, and on Day 2, there was a “science fair”. In all honesty, the experience was pretty much the same both nights. Walk your way around some tables/pedestals, and talk to people who are working on stuff that you might think is cool. The highlights for me were:

Jeremie Miller’s Locker project (github) (written on Node.js)
Impure, a visual tool for building data visualizations – here are some Strata/Twitter visualizations built with Impure
DrawnToScale, a big data database being developed here in Seattle

Links

Here is a bunch of miscellaneous interesting links from the conference:

Data BootCamp slides
The book Elements of Statistical Learning (in PDF form, too)
The paper The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis, recommended by Mark Madsen during his keynote
The book The Fourth Paradigm: Data-Intensive Scientific Discovery (in PDF form, too), recommended by Werner Vogels during his keynote
The $3M Heritage Health Prize to develop an algorithm to predict and prevent unnecessary hospitalizations
Public domain government information
All data covered by the Guardian since 2009
40 Fascinating blogs for the Ultimate Statistics Geek
A List of Data Markets

Tweet Mining

Finally, no conference on data should be without it’s own Twitter exhaust. So I’ll leave you with some analysis and visualizations done on the tweets from Strata.