TITLE OF PAPER: The Natural Language Toolkit URL OF PRESENTATION: _URL_of_powerpoint_presentation_ PRESENTED BY: Ed Loper REPRESENTING: _name_of_the_company_they_represent_ CONFERENCE: PyCON 2004 DATE: 20040326 LOCATION: _venue_and_room_in_venue_ -------------------------------------------------------------------------- REAL-TIME NOTES / ANNOTATIONS OF THE PAPER: {If you've contributed, add your name, e-mail & URL at the bottom} Toolkit used to teach students about NLP. UPenn (2001) Used at 8 other Universities GPL Try to reduce amount of programming knowledge needed to work on NLP stuff -- graduate level NLP course. NLTK Uses course assignments use an existing module to explore an algorithm or perform an experiment combine modules to form a complete system. Class demontrations tedious algorithms come to life with online demostrations interactive demos allow live topic explorations advanced projectes implement new algorithms add new functions since it takes the tedium of showing and implementing the core functions of NLP it really speeds up instruction Design goals requirements ease of use, consistency, extensibility, documentation, simplicity, modularity non-requirements comprehensiveness, efficient, cleverness Why Use Python? shallow learning curve python code is exceptinoally readable "executable psuedocode" interpreted language interactive exploration immediate feedback extensive standard library light-weight o-o system useful when it's needed but doesn't get in the way Design Overview flow control is organized around NLP tasks examples: tokenizing, tagging, parsing each task is defined by an interface: implemented as a stub clas with docstrings multiple implementations of each task different techniques and algorithsm different algorithms different tasks communicate with a token datatype Pipelines and Blackboards pipeline model takes a sentance, it's broken into different words by a tokenizer and then it's output each is a sequential transformation with information lost between steps blackboard model has a single place where the information from each task is added to the blackboard - nothing is removed or lost between the tasks. advantages of blackboard easier to experiment tasks can be rearranged students can swap in new implementations that have different requirements no need to worry about "threading" info through the system easier to debug we don't throw anything away easier to understand we build a single unified picture Tokens represent individual pieces of language e.g. documents, sentances, and words each token consists of a set of properties some typical properties text, part of speech, etc properties properties are not fixed or predetermined consenting adults, dynamic polymorphism properties are mutable but typically mutated monotonically, i.e. only add properties; don't delete or modify properties can contain/point to other tokens a sentance token's words property Locations: unique identififiers for tokens how many words in this phrase: An african swallow or a european swallow. the choices were 3, 6, 7 and 8 need to distinguish between abstract piece of language and an occurance create unique identifiers for tokens Specialized Tokens Use subclasses of Token to add specialzed behaviour e.g. ParentedTreeToken adds... Standard tree operations height, leaves, etc automatically maintained parent pointers All data is stored in properties Task Interfaces each task is defined by an interface. implemented as astub base class iwth docstrings conventionally named with a trailing "I" used only for documentation all interfaces have the same basic form: an "action" method monotonically mutates a token class parserI: def parse(token): """ """ Variations on a Theme where appropriate, interfaces can define a set of extended action methods: action() basic action_n() outputs the n best solutions action_dist() variant with solution distrbuted across all solutions xaction() raw_action() Building Algorithms Demos shown is an algorithm for parsing simple nested loops: Generators are used to do the heavy lifting of building the required code to parse the nested loops No time for Q&A -------------------------------------------------------------------------- REFERENCES: {as documents / sites are referenced add them below} PAPER: http://www.python.org/pycon/dc2004/papers/35/nltk/nltk.pdf project: http://nltk.sourceforge.net -------------------------------------------------------------------------- QUOTES: -------------------------------------------------------------------------- CONTRIBUTORS: {add your name, e-mail address and URL below} Ted Leung, twl@osafoundation.org, http://www.sauria.com/blog Mike Taylor, bear@code-bear.com -------------------------------------------------------------------------- E-MAIL BOUNCEBACK: {add your e-mail address separated by commas } -------------------------------------------------------------------------- NOTES ON / KEY TO THIS TEMPLATE: A headline (like a field in a database) will be CAPITALISED This differentiates from the text that follows A variable that you can change will be surrounded by _underscores_ Spaces in variables are also replaced with under_scores This allows people to select the whole variable with a simple double-click A tool-tip is lower case and surrounded by {curly brackets / parentheses} These supply helpful contextual information. -------------------------------------------------------------------------- Copyright shared between all the participants unless otherwise stated...