The Complete File System/ File System Virtualization Using Python.notes

Thursday, March 24, 2005

TITLE OF PAPER: The Complete File System: File System Virtualization Using Python
URL OF PRESENTATION: _URL_of_powerpoint_presentation_
PRESENTED BY: Christopher Gillett
REPRESENTING: Compete, Inc.

CONFERENCE: PyCon 2005
DATE: Thursday, March 24, 2005
LOCATION: GWU Cafritz Conference Center, Grand Ballroom

--------------------------------------------------------------------------
REAL-TIME NOTES / ANNOTATIONS OF THE PAPER:
{If you've contributed, add your name, e-mail & URL at the bottom}

Compete Inc.
    How to drive advert, marketing, sales
    Analyzing lots of data
    Need to manage lots of files

CFS Compete File System
    Application level (runs in user space)
    Uses MySQL with Python
        A few seconds per transaction
        But jobs run for hours, so it doesn't matter
        Still it may matter for real-time/fast applications

File System Virtualization
    Mapping multiple and potentially disparate file systems int a monolithic view such that applications, scripts, etc. need not know the physical location of the files that are being manipulated.
    Common directory structure for all participating file systems
    CFS uses Scheduler to select "next" device
    Database handles logical file name mapping to physical filenames

End User perspective
    Has to integrate cleanly with Unix scripts
        We solved this using cfsopen command
    
        cfsopen maps logical filenames to physical filenames for use in scripts
    Must have a programmable API
        Compete Data Access Layer manages many different data sources
            Wrap CFS functionality behind CDAL
            Then users can use CDAL as they always do.
        Module that maps virtual filenames to real filenames -- all file
        open/closing done with this module
    Users must be able to catalog and delete files as needed
        cfsls and cfsrm commands

Compete File System architecture
    Scheduler
        Which real filesystem to use next?
    Database manager
        Maps virtual filenames to real filenames
        Two of these, since replication wasn't available at the time
    Straightforward implementation
        Just map filenames
        Allow access to CFS file as "normally" as possible for end users
        Stateless model of sorts:
            No daemons to worry about
            Better portability for CFS
            Simple code

Scheduling and Load Balancing
    Version 1
        Multiple file systems on individual NFS servers
        Assume all file systems are about the same size
        Used a round robin approach
        Worked ok but large files & small files made problems
    Version 1.1
        Best fit scheduler - selects based on size of files
        Problem with many small/temp files that grow, since the files are not
            distributed very well
    Future
        Predictive Scheduler
            Score-boarding file system sizing
            Predicts sizes of new files based on average size in directory
            Allows applications to pre-allocate space as a hint to CFS
            Open file reaper tracks implicit file closes and updates stats

CFS Internal State
    Stateless processes
    State information stored in database
    Losing the CFS Database due to ___

Role of Python
    Easy to think about algorithms, functionality
    Consideration given to other languages for deployment -C, C++
        but rejected for the "usual" reasons

Whining and ranting
    Python VM footprint and speed are good, but a compiled lang could be better
    I/O in Python is a problem (dealing w/ files that are hundreds of GBs)

Results of Building and Deploying CFS
    Happy management
    Zero Worries
    Effective storage resource management
    Hundreds of thousands of files across multiple devices
    Multiple terabytes of data under coherent structure



--------------------------------------------------------------------------
REFERENCES: {as documents / sites are referenced add them below}


--------------------------------------------------------------------------
QUOTES:



--------------------------------------------------------------------------
CONTRIBUTORS: {add your name, e-mail address and URL below}
Linden Wright <lwright@mac.com>
Abhay Saxena <ark3@email.com>

--------------------------------------------------------------------------
E-MAIL BOUNCEBACK: {add your e-mail address separated by commas }



--------------------------------------------------------------------------
NOTES ON / KEY TO THIS TEMPLATE:
A headline (like a field in a database) will be CAPITALISED
    This differentiates from the text that follows
A variable that you can change will be surrounded by _underscores_
    Spaces in variables are also replaced with under_scores
    This allows people to select the whole variable with a simple double-click
A tool-tip is lower case and surrounded by {curly brackets / parentheses}
    These supply helpful contextual information.

--------------------------------------------------------------------------
Copyright shared between all the participants unless otherwise stated...