TITLE OF SESSION: Query-directed Data Mining using Python and Parallel Processing NUMBER OF SESSION: _number_of_the_session_here_ PRESENTED BY: _names_of_the_presenters_ CONFERENCE: _name_of_conference_ DATE: _date_and_time_ LOCATION: _location_ ------------------------------------------------------------------------------- REAL-TIME NOTES: {If you've contributed, add your name, e-mail & URL at the bottom} INTRODUCTION Compete analyzes web data of consumer trends Tera-scale storage requirements Want to do ad-hoc research against large data sources PBS: Portable Batch System Running 1.5-2.0 million jobs per year over archived data Query-Directed Data Mining View everything as a database, even things not in database SQL or something close to it Build language and/or runtime extensions to SQL Provide built-in functions to handle situations unique to our data Extensibility incorporated into system from initial design to full realization Why not just use Oracle/DB2? Too expensive for this level company Competitors have tried, and failed Major components: SQL Language Processor Code Generator Query Decomposition & job authoring system Parrallelizing querys: select * from myTable where date >= 2005-01-01 and date <= 2005-01-31 This can be rewritten as 31 queries of select ... where date=one day Q&A ------------------------------------------------------------------------------- REFERENCES: {as documents / sites are referenced add them below} http://www.openpbs.org/ ------------------------------------------------------------------------------- QUOTES: {collect nice quotes from this session's speaker} ------------------------------------------------------------------------------- CONTRIBUTORS: {add your name, e-mail address and URL below} ------------------------------------------------------------------------------- E-MAIL BOUNCEBACK: {add your e-mail address separated by commas for easy mailing of this text} ------------------------------------------------------------------------------- NOTES ON / KEY TO THIS TEMPLATE: HEADLINES ... have to be CAPITALISED and stand alone in a line to be recognized This differentiates from the text that follows A _variable_ that you can change will be surrounded by _underscores_ Spaces in variables are also replaced with _under_scores_ This allows people to select the whole _variable_ with a simple double-click A {tool-tip} is lower case and surrounded by {curly brackets / parentheses} These supply helpful contextual information. References should be added as [1] [2] and so forth. An *emphasis* can be put on a word by adding *stars* around it ------------------------------------------------------------------------------- DISCLAIMER: Copyright shared between all the participants unless otherwise stated... Generic conference template copyright by Tom Coates, tom@plasticbag.org Additions and Conference.mode by Dominik Wagner, dom@codingmonkeys.de