Query-directed Data Mining using Python and Parallel Processing.notes

Thursday, March 24, 2005

TITLE OF SESSION: Query-directed Data Mining using Python and Parallel Processing
NUMBER OF SESSION: _number_of_the_session_here_
PRESENTED BY: _names_of_the_presenters_

CONFERENCE: _name_of_conference_
DATE: _date_and_time_
LOCATION: _location_

-------------------------------------------------------------------------------
REAL-TIME NOTES:
{If you've contributed, add your name, e-mail & URL at the bottom}

INTRODUCTION

Compete analyzes web data of consumer trends
   Tera-scale storage requirements
   Want to do ad-hoc research against large data sources
   
PBS: Portable Batch System

Running 1.5-2.0 million jobs per year over archived data

Query-Directed Data Mining
    View everything as a database, even things not in database
    SQL or something close to it
    Build language and/or runtime extensions to SQL
        Provide built-in functions to handle situations unique to our data
        Extensibility incorporated into system from initial design to full realization

Why not just use Oracle/DB2?
    Too expensive for this level company
    Competitors have tried, and failed

Major components:
    SQL Language Processor
    Code Generator
    Query Decomposition & job authoring system

Parrallelizing querys:
   select * from myTable where date >= 2005-01-01 and date <= 2005-01-31
   This can be rewritten as 31 queries of select ... where date=one day
   



Q&A


-------------------------------------------------------------------------------
REFERENCES: {as documents / sites are referenced add them below}
http://www.openpbs.org/

-------------------------------------------------------------------------------
QUOTES: {collect nice quotes from this session's speaker}

-------------------------------------------------------------------------------
CONTRIBUTORS: {add your name, e-mail address and URL below}



-------------------------------------------------------------------------------
E-MAIL BOUNCEBACK:
{add your e-mail address separated by commas for easy mailing of this text}



-------------------------------------------------------------------------------
NOTES ON / KEY TO THIS TEMPLATE:
HEADLINES
    ... have to be CAPITALISED and stand alone in a line to be recognized
    This differentiates from the text that follows
A _variable_ that you can change will be surrounded by _underscores_
    Spaces in variables are also replaced with _under_scores_
    This allows people to select the whole _variable_ with a simple double-click
A {tool-tip} is lower case and surrounded by {curly brackets / parentheses}
    These supply helpful contextual information.
References should be added as [1] [2] and so forth.
An *emphasis* can be put on a word by adding *stars* around it


-------------------------------------------------------------------------------
DISCLAIMER:
Copyright shared between all the participants unless otherwise stated...
Generic conference template copyright by Tom Coates, tom@plasticbag.org
Additions and Conference.mode by Dominik Wagner, dom@codingmonkeys.de