Tag Archives: data

Strange Loop 2012

I think that the most ringing endorsement that I can give Strange Loop is that it has been a very long time since I experienced so much agony when trying to pick which talks to go to during any given block.

Emerging Languages Camp

This year Strange Loop hosted the Emerging Languages Camp (ELC), which previously had been hosted at OSCON. I liked the fact that it was its own event, not yet another track in the OSCON panoply. That, coupled with a very PLT oriented audience this year, made Strange Loop a much better match for ELC than OSCON.

I definitely went into ELC interested in a particular set of talks. There is a lot of buzz around big data, and some of the problems around big data and data management more generally. Also I did my graduate work around implementing “database programming languages”, so there was some academic interest to go along with the practical necessity. There were three talks that fell into that bucket: Bandicoot: code reuse for the relational model, The Reemergence of Datalog, and Julia: A Fast Dynamic Language for Technical Computing.

I found Bandicoot a little disappointing. I think that the mid 90′ work of Buneman’s group at UPenn on Structural Recursion as a Query Language and Comprehension Syntax would be a better basis for a modulary and reusable system for programming relations.

Logic Programming may be making a resurgence via the work on core.logic in Clojure and the influence of Datalog on Cascalog, Datomic and Bloom. The Reemergence of Datalog was tutorial on Datalog for those who had never seen it before, as well as a survey of Datalog usage in those modern day systems.

Julia is a language that sits in the same conceptual space as R, SAS, SPSS, and so forth. The problem with most of those systems is that they were designed by statisticians and not programmers. So while they are great for statistical analysis, they are less good for statistical programming. Julia aims to improve on this, while adding support for distributed compuation and a very high performance implementation. There’s no decisive winner in the technical computing space, and it seems like Julia might have a chance to shine.

There were, of course, some other interesting language talks at ELC.

Dave Herman from Mozilla talked about Rust for the first time (at least to a large group). Rust is being developed as a systems programming language. There are some interesting ideas in it, particularly a very Erlang like concurrency model. At the same time, there were some scary things. Part of what Rust is trying to do is achieve performance, and part of how this happens is via explicit specification of memory/variable lifetimes. Syntactically this is accomplished via punctuation prefixes, and I was wondering if the code was going to look very Perl-ish. servo is browser engine that is being written in Rust, and looking at the source code of a real application will help me to see whether my Perlishness concern is valid.

Elixir: Modern Programming for the Erlang VM looks like a very nice way to program atop BEAM (the Erlang VM). Eliminating the prolog inspired syntax goes a long way, and it appears that Elixir also addresses some of the issues around using strings in Erlang. It wasn’t clear to me that all of the string issues have been addressed, but I was definitely impressed with what I saw.

Strange Loop Talks and Unsessions

I’m going to cover these by themes. I’m not sure these are the actual themes of the conference, but they are the themes that emerged from the talks that I went to.

First, and unsurprisingly, a data theme. The opening keynote, In Memory Databases: the Future is Now! was by Mike Stonebraker. It’s been a long time since I saw Stonebraker speak – I think that the last time was when I was in graduate school. He was basically making the case that transaction processing (TP) is not going away, and that there might be applications for a new generation of TP systems in some of the places where the various NoSQL systems are now being used. Based on that hypothesis/assumption, he then went on to describe the trends in modern systems and how they would lead to different design, much of which is embodied in VoltDB. This was a very controversial talk, at for some people. I considered the trend/system analysis part to be reasonable in a TP setting. I’m not sure that I agree with his views on the applicability of TP, but I’m fairly sure that time will sort all of that out. I think that this is an important point for the NoSQL folks to keep in mind. When the original work on RDBMS was done, it was mocked, called impractical, not useful and so forth. It took many years of research and technology development. I think that we should expect to see something similar with NoSQL, although I have no idea how long that timeline will be.

Nathan Marz’s talk Runaway Complexity in Big Data… and a plan to stop it. was basically making the case for, and explaining the hybrid/combined batch/realtime architecture that he pioneered at BackType, and which is now in production at Twitter. That same architecture led to Cascading and Storm, which are pretty interesting systems. Marz is working on a book with Manning that will go into the details of his approach.

The other interesting data talks revolved around Datomic. Unfortunately, I was unable to attend Rich Hickey’s The Database as a Value, so I didn’t get to hear him speak directly about Datomic. There are several Datomic related videos floating around, so I’ll be catching up on those. I was able to attend the evening unsession Datomic Q&A / Hackfest. This session was at 9pm, and was standing room only. I didn’t have quite enough background on Datomic to follow all of what was said, but I was very interested by what I saw: the time model, the immutability of data which leads to interesting scalability, the use of Datalog. I’m definitely going to be looking into it some more. The one thing that troubles me is that it is not open source. I have no problem with a paid supported version, but it’s hard to make the argument for proprietary system or infrastructure software nowadays.

Another theme, which carried over from ELC was logic programming. I had already heard Friedman and Byrd speak at last fall’s Clojure/conj, and I was curious to see where they have taken miniKanren since then. In their talk, Relational Programming in miniKanren, they demonstrated some of what they showed previously, and then they ran out of material. So on the fly, they decided to implement a type inferencer for simple lambda terms live on stage. Not only were they able to finish it, but since it was a logic program, they were also able to run it in reverse, which was pretty impressive. I was hoping that they might have some additional work on constraints to talk about, but other than disequality constraints, they didn’t discuss anything. Afterwards in Twitter, Alex Payne pointed out that there are some usability issues with miniKanren’s API’s. I think that this is true, but it’s also true that this is a research system. You might look at something like Clojure’s core.logic for a system that’s being implemented for practitioners.

David Nolen did an unsession Core Logic: A Tutorial Reconstruction where he walked the audience through the operation of core.logic, and by extension, miniKanren, since the two systems are closely related. He pointed out that he read parts of “The Reasoned Schemer” 8 times until he understood it enough to implement it, and then he found that he didn’t really understand it until after the implementation was done. There was also a large crowd in this session, and Christopher Petrelli made a video recording on his phone, since InfoQ wasn’t recording the unsessions.

The final talk in the logic programming them was Oleg Kiselyov’s talk Guess lazily! Making a program guess and guess well. Kiselyov has been around for a long time and written or coauthored many important papers related to Scheme and continuations. I’ve be following (off and on) his work for a long time, but this is the first time I was at a conference where he was speaking. I was shocked to find that the room was packed. His talk was about how to defer making the inevitable choices required by non-determinism, especially in the context of logic type systems. His examples were in OCaml, which I had some trouble following, but after Friedman and Byrd the day before, he apparently felt compelled to write a type inferencer that could be run backwards as well. His code was a bit longer than the miniKanren version.

The next theme is what I’d call effective use of functional programming. The first talk was Stuart Sierra’s Functional Design Patterns. This was a very worthwhile talk, which I won’t attempt to summarize since the slides are available. Needless to say, he found a number of examples that could be called design patterns. This was one of the talks where I need to sit down and look at the patterns and think on them for a while. That’s hard to do during the talk (and the conference, really). Some things require pondering, and this is one of them.

The other talk in this category was Graph: composable production systems in Clojure, which described the Prismatic team’s approach to composing systems in Clojure. What they have is an abstraction that allows them to declaratively specify how the parts of the system are connected. For a while it just looked to me like a way to encode a data flow graph in a Clojure abstraction. The aha moment was when he showed how they use Clojure metadata to annotate the arguments or pipe connectors if you will. The graphs can be compiled in a variety of ways including Clojure lazy maps, which present some interesting possibilities. Unfortunately, I had to leave half way through the talk, so I missed the examples of how the apply this abstraction in their system.

Theme number four was programming environments. I hesitate to use the term IDE, because it connotes a class of tools that is loved by some, reviled by others, and when you throw that term around, it seems to limit people’s imagination. I contributed to the Kickstarter for Light Table, so I definitely wanted to attend Chris Granger’s talk Behind the Mirror: The birth of Light Table. Chris gave a philosophical preamble before showing off the current version of Light Table. He demonstrated adding support for Git in a short amount of code, and went on to demonstrate a mode for developing games. He said that they are planning to release version 1 sometime in May, and that Light Table will be open source. I also learned that Kickstarter money is counted as revenue, so they have lost a significant amount of the donations to taxes, which is part of the reason that Kodawa participated in Y Combinator, and is trying to raise some money to get a bigger team.

Not long after the Light Table kickstarter, this video by Bret Victor made the rounds. It went really well with all the buzz about Light Table, and Alex Miller, the organizer of Strange Loop, went out and persuaded Bret to come and talk. Bret’s title was Taking off the Blindfold, and I found this to be a very well motivated talk. In the talk, Bret talked about the kinds of proerties that our programming tools should have. The talk was vey philosophical despite the appearance of a number of toy demos of environment features.

During both of these talks there was a lot of chatter. Some was harking back to the Smalltalk (but sadly, not the Lisp Machine) environments,while some questioned the value of a more visual style of tools (those emacs and vi graybeards). When I first got into computers I read book called “Interactive Programming Environments” and ever since i’ve always been wishing for better tools. I am glad to see some experimentation come back into this space.

Some old friends are busy making hay in the Node.js and Javascript communities, and it probably horrifies theme that I have ClojureScript as a theme, but so be it. I went to two ClojureScript talks. One was David Nolen’s ClojureScript: Better Semantics at Low Prices!, which was really state of the union of ClojureScript. The second was Kevin Lynagh’s Building visual data driven UI’s with ClojureScript. Visualization is becoming more and more important and ClojureScript’s C2 library look really appealing.

It’s fitting that the last them should be Javascript. Well, maybe. I went to two Javascript talks, and both of them were keynotes, so I didn’t actually choose them. But Javascript is so important these days that it really is a theme. In fact, it’s so much of a theme, that I’ve been going to Javascript conferences for the last 2 years. It’s been several years since I saw Lars Bak speak. His talk on Pushing the Limits of Web Browsers was in two parts. Or so I think. I arrived just as he was finishing the first part which seemed like an account of the major things that the V8 team has learned during their amazing journey of speeding up Javascript. The second part of his talk was about Dart. I didn’t know that Bak was the lead of the Dart project, but that doesn’t change how I feel about Dart. I see the language, I understand the rationale, and I just can’t get excited about it.

I’ve been to enough of those Javascript only talks to hear Brendan Eich talk about The State of Javascript. Brendan opened by giving a brief history of how Javascript got to be the way it is, and then launched into a list of the improvement coming in EcmaScript 6 (ES6). That was all well and good, and towards the end, after the ES6 stuff, he threw in some items that were new, like the sweet.js hygienic macro project, and the lljs typed JavaScript project. It seemed like this was a good update for this audience, who seemed unaware of all the goings on over in JavaScript land. From a PLT point of view, I guess that’s understandable, but at the same time, JavaScript is too important to ignore.

Final Thoughts

Strange Loop has grown to over 1000 people, much larger than when I attended in 2010 (I had to miss 2011). I think that Alex Miller is doing a great job of running the conference, and of finding interesting and timely speakers. This was definitely the best conference that I attended this year, and probably the last 2-3 years as well.

If you’re looking for more information on what happened at Strange Loop 2012:

Slides: https://github.com/strangeloop/strangeloop2012/tree/master/slides

Other Strange Loop Coverage: https://github.com/strangeloop/strangeloop2012/wiki/Coverage

Strata 2012

Leave a reply

Here’s a roundup of last week’s Strata conference.

Jumpstart

This year, the O’Reilly team introduced a new tutorial day track, called “Jumpstart”. This track was more oriented towards the business side of big data, and I think that the word MBA actually appeared in the marketing. I think that the track was a success, and was very appropriate. The effect of the next generation of data oriented technologies and applications is going to be very significant, and will have a big impact on the way that business operate. It’s very important that technologists and business people work closely in order to produce the best results.

There were two talks that stood out for me. The first was Avinash Kaushik’s What Marketers can learn from Analysis. Kaushik is a very entertaining and dynamic speaker, and he has had a lot of experience working to help companies use analytics effectively. In his world, processing and storage is 10% of what you need, and analysts – humans are the other 90%. In other words, technology is not nearly as important as having people who can ask the right questions and verify hypotheses experimentally. And even good analysis is not enough. Organizations must be able to act on the results of analysis. I have been (and will continue to be) interested in the ability to use data as quickly as it is collected. Some people call this a “real-time” data capability, although in computer science terms, this is a misnomer. One of the best quotes from Kaushik’s talk was “If you do not have the capacity to take real time action, why do we need real time data?”. Without the ability to act, all the data collection and analysis in the world is fruitless. Kaushik’s claim was that we must remove all humans from the process in order to achieve this. Back to analysis, Kaushik feels that the three key skills of data analysis are: the scientific method, design of experiments, and statistical analysis.

The second talk was 3 Skills of a Data Driven CEO by Diego Saenz. I liked his notion that a company’s data is a raw material, just like any other raw material that might be used by a company. Raw materials must be collected, mined, purifed, and transformed before they can turn into a product, and so with a company’s data. The most important information that I got out of this talk was the case study that he presented on the Bob McDonald, the CEO of Proctor and Gamble. P&G has built a business wide real time information system called Business Sphere. One manifestation of Business Sphere is a pair of 8 foot high video screens that sit in the conference room used by the CEO for his regular staff meeting. Real time data on any aspect of the company’s operations can be displayed on these screens, discussed and acted upon at the CEO staff level. Also of note is that a data analyst attends the CEO staff meeting in order to facilitate discussion and questions about the data. I remember back in the 2000’s when Cisco talked about how they could close their books in a day. Now we have the worlds largest consumer products company with a real time data dashboard in the CEO’s conference room. The bar is being raised on all companies in all industries.

Talks

I felt that the talks In the regular conference were weaker than last year. Part of that may be due to my talk selection – there were lots of tracks, and in some cases it was hard to figure out which talks to pick. I tend to seek out unusual content, which means more risk in terms of a “quality” talk. The advent of the O’Reilly all access path has taken some of the risk out, since that pass gives you access to the full video archive of the entire conference. The topic of video archives is probably content for another blog post. I know that there are some talks that I missed that I want to watch the videos for, but apparently, I’ll need to wait several weeks. It will be interesting to contrast that with this week’s mostly volunteer run PyCon, which has a great track record of getting all their videos up on the web during the conference, for no fee.

Talks which were easy to remember included Sam Shah’s Collaborative Filtering with MapReduce, which included a description of how to implement collaborative filtering on Hadoop, but more importantly discussed many of the issues around building a production worthy version of such a system. It’s one thing the implement a core algorithm. It’s another to have all the rest of the infrastructure so that the algorithm can be used for production tasks.

A large portion of the data the people are interested in analyzing is coming from social networks. I attended Marcel Salathé’s Understanding Social Contagion in the hopes of gaining some greater insight into virality. Salathé works at an infectious disease center and he spent a long time comparing biological contagion with internet virality. I didn’t find this to be particularly enlightening. However, in the last third of the talk, he started talking about some of the experimental work that his group had done, which was a little more interesting. The code for his system is available on github.

I really enjoyed DJ Patil’s talk Data Jujitsu: The Art of Turning Data into Product. According to Patil, data jujitsu is using data elements in an iterative way to solve otherwise impossible data problems. A lot of his advice had to do with starting small and simple, and moving problems to where they were easiest to solve, particularly in conjunction with human input. As an example, he discussed the problem of entity resolution in one of the LinkedIn products, and described how they moved the problem from the server side, where it was hard, to the client side, where it was easy if you asked the user a simple question. The style he discussed was iterative, opportunistic, and “lazy”.

Jeremy Howard from Kaggle talked about From Predictive Modelling to Optimization: The Next Frontier. Many companies are now building a lifetime value model of a customer, and some companies are even starting to build predictive models. Howard’s claim was that the next steps in the progression are take these models and use them to build simulations. Once we have simulations, we can then use optimization algorithms on the inputs to the simulation, and optimize the results in the direction

Keynotes

Last year, I was pretty unhappy with a number of the keynotes, which were basically vendor pitches. This year things were much better, although there were one or two offenders. Microsoft was NOT one of the offenders. Dave Campbell’s Do We Have The Tools We Need To Navigate The New World Of Data? was one of the better Microsoft keynotes that I’ve seen at an O’Reilly event (i.e. out of the Microsoft ecosystem). The talk included good non-Microsoft specific discussion of the problems, references to academic papers (each with at least one Microsoft author), and a friendly, collegial, non-patronizing tone. I hope that we’ll see more of this from Redmond.

Avinash Kaushik had a keynote spot, and one of the most entertaining, but insightful slides was an infamous quote from Donald Rumsfeld

[T]here are known knowns; there are things we know we know.

We also know there are known unknowns; that is to say we know there are some things we do not know.

But there are also unknown unknowns – there are things we do not know we don’t know.

Kaushik was very keen on “unknown unknowns”. These are the kind of things that we are looking to find, and which analytics and big data techniques might actually help discover. He demonstrating a way of sorting data which leaves out the extremes, and leaves the rest of the data, which is likely where the unknown unknowns are hiding.

I’ve been a fan of Hal Varian ever since I read his book “Information Rules: A Strategic Guide to the Network Economy” back during the dot-com boom. One the one hand, his talk Using Google Data for Short-term Economic Forecasting, was basically a commercial for Google Insights for Search. On the other hand, the way that he used it and showed how it was pretty decent for economic data was interesting. There were several talks that included the use of Google Insights for Search. It’s a tool that I’ve never paid much attention to, but I think that I’m going to rectify that.

The App

This is the first O’Reilly conference I’ve attended where they had a mobile app. There were iPad, iPhone, and Android versions. I only installed the iPad version, and I really liked it. I used it a lot when I was sitting in sessions to retrieve information about speakers, leave ratings and so forth. I’d love to see links to supplemental materials appear there. I also liked the fact that the app synced to the O’Reilly site, so that my personal schedule was reflected there. I didn’t like the fact that the app synced to the O’Reilly website because the WiFi at the conference was slow, and I often found myself waiting for those updates to finish before I could use the app. The other interesting thing was that I preferred the daily paper schedule when I was walking the hall between sessions. Part of this was due to having to wait for those updates, but part of it was that there was no view in the app that corresponded to the grid/track view of the paper schedule. More work to do here, but a great start.

Final thoughts

This year’s attendance was over 2300, up from 1400 last year, and I saw badges from all sorts of companies. It is apparent to me that the use of data and analytics being discussed at Strata is going to be the new normal for business.

Strata 2011

11 Replies

I spent three days last week at O’Reilly’s Strata Conference. This is the first year of the conference, which is focused on topics around data. The tag line of the conference was “Making Data Work”, but the focus of the content was on “Big Data”.

The state of the data field

Big Data as a term is kind of undefined in a “I’ll know it when I see it” kind of way. As an example,I saw tweets asking how much data one needed to have in order to qualify as having a Big Data problem. Whatever the complete meaning is, if one exists, there is a huge amount of interest in this area. O’Reilly planned for 1200 people, but actual attendance was 1400, and due to the level of interest, there will be another Strata in September 2011, this time in New York. Another term that was used frequently was data science, or more often data scientists, people who have a set of skill that make them well suited to dealing with data problems. These skills include programming, statistics, machine learning, and data visualization, and depending on who you ask, there will be additions or subtractions from that list. Moreover, this skill set is in high demand. There was a very full job board, and many presentations ended with the words “we’re hiring”. And as one might suspect, the venture capitalists are sniffing around — at the venture capital panel, one person said that he believed there was a 10-25 year run in data problems and the surrounding ecosystem.

The Strata community is a multi disciplinary community. There were talks on infrastructure for supporting big data (Hadoop, Cassandra, Esper, custom systems), algorithms for machine learning (although not as many as I would have liked), the business and ethics of possessing large data sets, and all kinds of visualizations. In the executive summit, there were also a number of presentations from traditional business intelligence, analytics, and data warehousing folks. It is very unusual to have all these communities in one place and talking to each other. One side effect of this, especially for a first time conference, is that it is difficult to assess the quality of speakers and talks. There were a number of talks which had good looking abstracts, but did not live up to those aspirations in the actual presentation. I suspect that it is going to take several iterations to identify the the best speakers and the right areas – par for a new conference in a multidisciplinary field.

General Observations

I did my graduate work in object databases, which is a mix of systems, databases, and programming languages. I also did a minor in AI, although it was in the days before machine learning really became statistically oriented. I’m looking forward to going a bit deeper into all these areas as I look around in the space.

One theme that appeared in many talks was the importance of good, clean data. In fact, Bob Page from eBay showed a chart comparing 5 different learning algorithms, and it was clear that having a lot of data made up for differences in the algorithms, making the quality and volume of the data more important than the details of the algorithms being used. That’s not to say that algorithms are unimportant, just that high quality data is more important. It seems obvious that having access to good data is really important.

Another theme that appeared in many talks was the combination of algorithms and humans. I remember this being said repeatedly in the panel on predicting the future. I think that there’s a great opportunity in figuring out how to make the algorithm and human collaboration work as pleasantly and efficiently as possible.

There were two talks that at least touched on building data science teams, and on Twitter it seemed that LinkedIn was viewed as having one of the best data science teams in the industry. Not to take anything away from the great job that the LinkedIn folks are doing, or the importance of helping people find good jobs, but I hope that in a few years, we are looking up to data science teams from healthcare, energy, and education.

It amused me to see tweets and have discussions on the power of Python as a tool in this space. With libraries like numpy, scipy, nltk, and scikits.learn, along with an interactive interpreter loop, Python is well suited for data science/big data tasks. It’s interesting to note that tools like R and Incanter have similar properties.

There were two areas that I am particularly interested in, and which I felt were somewhat under represented. The issue of doing analysis in low latency / “realtime” scenarios, and the notion of “personal analytics” (analytics around a single person’s data). I hope that we’ll see more on these topics in the future.

The talks

As is the case nowadays, the proceedings from the conference are available online in the form of slide decks, and in some cases video. Material will probably continue to show up over the course of the next week or so. Below are some of the talks I found noteworthy.

Day 1

I spent the tutorial day in the Executive Summit, looking for interesting problems or approaches that companies are taking with their data efforts. There were two talks that stood out to me. The first was Bob Page’s talk Building the Data Driven Organization, which was really about eBay. Bob shared from eBay’s experience over the last 10 years. Probably the most interesting thing he described was an internal social network like tool, which allowed people to discover and then bookmark analytics reports from other people.

Marilyn and Terence Craig presented Retail: Lessons Learned from the First Data-Driven Business and Future Directions, which was exactly how it sounded. It’s conventional wisdom among Internet people that retail as we know it is dead. I came away from this talk being impressed by the problems that retail logistics presents, and by how retail’s problems are starting to look like Internet problems. Or is that vice versa?

Day 2

The conference proper started with the usual slew of keynotes. I’ve been to enough O’Reilly conferences to know that some proportion of the keynotes are given in exchange for sponsorships, but some of the keynotes were egregiously commercial. The Microsoft keynote included a promotional video, and the EnterpriseDB keynote on Day 3 was a bald faced sales pitch. I understand that the sponsors want to get value for the money they paid (I helped sponsor several conferences during my time at Sun). The sponsors should look at the twitter chatter during their keynotes to realize that these advertising keynotes hurt them far more than they help them. Before Strata, I didn’t really know anything about EnterpriseDB except that they had something to do with Postgres. Now I all I know is that they wasted a bunch of my time during a keynote spot.

Day 2 was a little bit light on memorable talks. I went to Generating Dynamic Social Networks from Large Scale Unstructured Data which was in the vendor presentation track. Although I didn’t learn much about the actual techniques and technologies that were used, I did at least gain some appreciation for the issues involved. The panel Real World Applications Panel: Machine Learning and Decision Support only had two panelists. Jonathan Seidman and Robert Lancaster from Orbitz described how they use learning for sort optimization, intelligent caching, and personalization/segmentation, and Alasdair Allan from the University of Exeter described the use of learning and multiagent systems to control networks telescopes at observatories around the world. The telescope control left me with a vaguely SkyNet ish feeling. Matthew Russell has written a book called Mining the Social Web. I grabbed his code off of github and it looked interesting, so I dropped into his talk Unleashing Twitter Data for Fun and Insight. He’s also written 21 Recipes for Mining Twitter, and the code for that is on github as well.

Day 3

Day 3 produced a reprieve on the keynote front. Despite the aforementioned horrible EnterpriseDB keynote, there were 3 very good talks. LinkedIn’s keynote on Innovating Data Teams was good. They presented some data science on the Strata attendees and described how they recruited and organized their data team. They did launch a product, LinkedIn Skills, but it was done in such a way as to show off the data science relevant aspects of the product.

Scott Yara from EMC did a keynote called Your Data Rules the World. This is how a sponsor keynote should be done. No EMC products were promoted, and Scott did a great job of demonstrating a future filled with data, right down to still and video footage of him being stopped for a traffic violation. The keynote provoked you to really thing about where all this is heading, and what some of the big issues are going to be. I know that EMC make storage and other products. But more than that, I know that they employ Product Management people who have been thinking deeply about a future that is swimming with data.

The final keynote was titled Can Big Data Fix Healthcare?. Carol McCall has been working on data oriented healthcare solutions for quite some time now, and her talk was inspirational and gave me some hope that improvements can happen.

Day 3 was the day of the Where’s the Money in Big Data? panel, where a bunch of venture capitalists talked about how they see the market and where it might be headed. It was also the day of two really good sessions. In Present Tense: The Challenges and Trade-offs in Building a Web-scale Real-time Analytics System, Ben Black described Fast-IP’s journey to build a web-scale real-time analytics system. It was an honest story of attempts and failures as well as the technical lessons that they learned after each attempt. This was the most detailed technical talk I attended, although the terms distributed lower dimensional cuboid and word-aligned bitmap index were tossed around, but not covered in detail. It’s worth noting that Fast-IP’s system and Twitter’s Analytics system, Rainbird, are both based, to varying degrees, on Cassandra.

I ended up spending an extra night in San Jose so that I could stay for Predicting the Future: Anticipating the World with Data, which was in the last session block of the conference. I think that it was worth it. This was a panel format, but each panelist was well prepared. Recorded Future is building a search engine that uses the past to predict the future. They didn’t give out much of their secret sauce, but they did say that they have built a temporally based index as opposed to a keyword based one. Unfortunately their system is domain specific, with finance and geopolitics being the initial domains. Palantir Technologies is trying to predict terrorist attacks. In the abstract, this means predicting in the face of an adaptive adversary, and in contexts like this, the key is to stop thinking in terms of machine learning and start thinking in terms of game theory. It seems like there’s a pile of interesting stuff in that last statement. Finally, Rion Snow from Twitter took us through a number of academic papers where people have successfully made predictions about box office revenue, the stock market, and the flu, just from analyzing information available via Twitter. I had seen almost all of the papers before, but it was nice to feel that I hadn’t missed any of the important results.

Talks I missed but had twitter buzz

You can’t go to every talk at a conference (nor should you, probably), but here are some talks that I missed, but which had a lot of buzz on Twitter. MAD Skills: A Magnetic, Agile and Deep Approach to Scalable Analytics – the hotness of this talk seemed related more to the DataWrangler tool (for cleansing data) than the MAD library (scalable analytics engine running inside Postgres) itself. Big Data, Lean Startup: Data Science on a Shoestring seemed like it had a lot of just good commonsense about running in a startup in addition to know how to do data science without doing overkill. Joseph Turian’s New Developments in Large Data Techniques looked like a great talk. His slides are available online, as well as the papers that he referenced. It seemed like the demos were the topic of excitement in Data Journalism: Applied Interfaces, given jointly by folks from ReadWriteWeb, The Guardian, and The New York Times. Rainbird is Twitter’s analytics system, which was described in Real-time Analytics at Twitter. Notable news on that one is that Twitter will be open sourcing Rainbird once the requisite version of Cassandra is released.

Evening activities

There were events both evenings of the show, which made for very long days. On Day 1 there was a showcase of various startup companies, and on Day 2, there was a “science fair”. In all honesty, the experience was pretty much the same both nights. Walk your way around some tables/pedestals, and talk to people who are working on stuff that you might think is cool. The highlights for me were:

Jeremie Miller’s Locker project (github) (written on Node.js)
Impure, a visual tool for building data visualizations – here are some Strata/Twitter visualizations built with Impure
DrawnToScale, a big data database being developed here in Seattle

Links

Here is a bunch of miscellaneous interesting links from the conference:

Data BootCamp slides
The book Elements of Statistical Learning (in PDF form, too)
The paper The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis, recommended by Mark Madsen during his keynote
The book The Fourth Paradigm: Data-Intensive Scientific Discovery (in PDF form, too), recommended by Werner Vogels during his keynote
The $3M Heritage Health Prize to develop an algorithm to predict and prevent unnecessary hospitalizations
Public domain government information
All data covered by the Guardian since 2009
40 Fascinating blogs for the Ultimate Statistics Geek
A List of Data Markets

Tweet Mining

Finally, no conference on data should be without it’s own Twitter exhaust. So I’ll leave you with some analysis and visualizations done on the tweets from Strata.