Python at Google.notes

Friday, March 25, 2005

TITLE OF PAPER: Python at Google
URL OF PRESENTATION:
--not available--
PRESENTED BY: Greg Stein
REPRESENTING: Google

CONFERENCE: PyCon 2005
DATE:
March 25, 2005
LOCATION:
Marvin Theater
--------------------------------------------------------------------------

REAL-TIME NOTES / ANNOTATIONS OF THE PAPER:
{If you've contributed, add your name, e-mail & URL at the bottom}

[ A new copy of the O'Reilly Python Success Stories booklet will be produced
    Contact Stephan Diebel @ pythonology.org ]


"Python has been an important part of Google since the beginning, and remains so as the system grows and evolved. Today dozens of Google engineers use Python, and we're looking for more people with skils in this language"
-- Peter Norvig, Director of Search Quality at Google

My background

    Python developer
        10 years
        Contributed to P
ython itself
        Authored a number of modules and applications    

            ViewCVS
    
    Open Source Guy
        Contributed to numerous projects (including Python)
        Current chairman of the
Apache Software Foundation
        ViewCVS, written entirely in Python
        Contributed to Subversion, Apache server

    "We consider Pyt
hon to be our 'secret sauce'"
    --Paul Everitt, talking about Digital Creations, circa 1996
    This is a recognition of how Python can help a business.


My view of Python in the workplace

    Python at eShop
        1995 "What in the world is Python?"
        1996 "This is great stuff."
        (MS acquired eShop in '96)
        

    Python at Microsoft
        1996: "It's called what?"
        1997: "You actually shipped
Python code?" (MerchantServer 1.0)
        1998: "Nice prototype. We'll rewrite it in the next version." And they
            did, in C++.



Python in the workplace (continued)
    Python at CollabNet
        2001: "No, we don't really use Python here." (they used Java)
        
2003: "Definitely! Write that in Python"
        
        Python caught on here like a virus, moving from developer to developer.


    Python at Google
        2004 "Of *course* we use Python.  Why wouldn't we?"


Changing attitudes over time
    
Small companies eventually "Got it" ahead of the curve
        
Champion was needed

    Larger Companies follow Python's growth curve
        Supporting environment was needed

A number of factors made Python possible in larger organizations:
It is now possible. Here's why:
    Python had to grow for it to become "business acceptable"
        Large enough talent pool
- "where are we going to be able to find these people?"
        Support services: Books, Consulting, World Wide Web
        Follow the trailblazers
    Python passed the tipping point years ago
        Not a problem to incorporate it into your business, lots of support,    
        consulting

Business advantage
    "These are some of the reasons we use Python at Google"
        Highly adaptable
            Changing requirements
            - You need a language that is very flexible, so you can adapt your tools during development
            Changes in computing environment

        Rapid development
            For new and experienced developers

            The market moves very very quick; you want to be able to keep up with it. If it takes two years for you to respond to something that is needed today, you're behind the curve.

    Easy to maintain - most important point in Greg's viwe
        You can come back a year later, look at that code, and understand what
            is going on

Google's programming environment
    Primary Languages
        C++
        Java
        Python

        If you want to write a piece of something else, like Perl, you have to
            almost get special permission.
 (Exceptions in ops, but for actual
            product stuff, see above)

    
Miscellaneous
        Some Perl used by Operations (others almost have to get permission to use Perl)
        PHP creeeps in for internal webapps
        
Saw Ruby sneaking around
        Small amount of C#

        In actual progress stuff, C++, Java, Python


SWIG is your friend
    SWIG:
Simplified Wrapper Interface Generator
        www.swig.org
        Started by David Beazley

    Multi-language environment
        A lot of people at Google don't know Python and produce C++ code.
        SWIG pulls these "islands together"--they have a lot of stuff lying
            around written in various languages. SWIG examines a C++ header file
            and auto-generates Python
bindings
            So for all of our libraries that we have - for parsing HTML,

                
crawling HTTP and so on - they are made available to Python
                
using SWIG.
            Good for Google programmers who use C++ but don't know Python

        Very fast mechanism for integration

    Integrated into build system
        Makes it very easy for us to add a rule into our build system to just add a library into our python dependancy module

Where do we use it?
    Across our internal network
    Across a system lifecycle
    Live Services

Basic Network
    
<diagram of development pushing through infrastructure to (1000) servers>

Some usage to support development
    
Wrappers for Version control (Perforce) (JB note: Perforce can output
    
marshalled Python objects -- very cool, extremely useful for scripting.  Also see svn SWIG mention in Q&A)
        
They improved branch management.
        Running unit tests on checkin
        People "earn" their ability to check
in after then understand code
            guidelines, etc.
        
Automatically enforce style guidelines
    Build System
(itself written in Python)
    Packaging
        We've got giant bundles of code and giant bundles of data which need to
        
be delivered up to the servers.
        Packaging system is built in
Python
        Third generation of this system
        Ability to roll back a version
        We can keep iterating and moving forward because we're building all this stuff in Python


Some usage in the network infrastructure
    Binary/data pusher
        Figures out best way to send stuff from one place to
            another -- dev to data center, etc

        We're on third/fourth generation of this, keep increasing the scale of
        
the problem. Python's making that possible - able to iterate quickly
    Package repository


Some usage on production servers
    Monitoring
        Is this thing still alive? Is it running? Does it think it's healthy? Is

        
it seeing problems with the hard disk? Is the CPU temperature fine?
        All of this information is gathered with a little Python program running on the server, then collected by another Python program.

    Auto
-restart

Complete the Lifecycle
    Log reporting
        We generate a "large" amount of log information
    
Data is pulled back from the servers
    Analyzed using lots of Python tools
        Ad group need
s to spot fraudulent clicks. This is a constant cat-and-            
            mouse game with the script kiddies writing fraudulent ad clickers.

        Easy to alter the reports based on ever-changing needs
        Every time we find some way people are fraudulently clicking our ads, we

        
patch that hole. It's a continuous process.


Python-based servics
    
Google Groups
        "Python Old-timers" David Jeske and Brandon Long (of eGroups and
            
Neotonic/ClearSilver) are the leads on Groups.
        All built using Python code

        Highly pythonic

        They didn't use that giant mountain of C++ stuff

    code.google.com
        Stein and DiBona
    
Others? We have so much going on...

How code.google.com was built (block diagram)
        /\ \/
    Front end Stuff
        /\ \/
    code.google.com
        SWIG
     Google Stuff

The funky front end stuff deals with denial of service attacks, reporting, blocking IPs known to be bad
    We get to take advantage because we've wrapped this
The HTTP server it's built on has all of the reporting and monitoring things on it - the "Google Stuff"


code.google.com
    goopy package
- support for functional style programming
        F
unctional stuff to start with
        Place to put future modules

Closing
    We have a lot of Python code, covering a broad range of needs
.
    Python has helped Google for many, many years.
    SWIG is underrated.
        I saw a little rant on Guido's blog (Guido shakes head) - it's kind of difficult to get your head wrapped around it but when you need access to some library of functionality from Python you don't need to go and bulid it yourself - you can use SWIG to wrap it automatically. This fits the Python ideal of smart reuse.
    We are no
w starting to open-source some of the pile.


Questions and Answers (a good 25 minutes for these)

Q: When are you going to open source the build system? (Guido)
A: I don't know.  If I recall, Greg has talked about it
   Chris DiBona: We're thinking of releasing some of our wrappers around

   
Perforce first

Q: About SWIG, have you looked at the Boost
::Python library?
A: I did see that come up recently; I don't think we use it a lot but it has
   
been mentioned. I'll take a closer look at it.

Q: What about ctypes?
A:
I saw that a while ago on a different project.  As far as I know we don't use
   
it, SWIG works well with our build system
Q: elaborates on ctypes/SWIG differences
. While SWIG will build a
   
Python wrapper for a given C lib, ctypes will let you dynamically load up a C
   
lib and call its functions.
A: calldll does something similar for windows environment

Q:
Do you do anything in regard to network monitoring / SNMP with Python?
A:
We do have a very large internal network, lots of traffic, the Ops guys do
   have monitors to watch the flow, have to schedule moving large
(100 GB or 1
   
TB-size) files.

Q: (Alex Martelli - who is starting at Google in three days) Back to the
   
wrapping issue.  SWIG and ctypes will not help at all with C++ templates -
   
Boost is better in this regard.  SWIG has been extended to support templets
   
recently.
A:
We do use some templates, but we normally try to avoid them and use SWIG. In
   
that sense, SWIG works well for us. Some of the template stuff I'd like  
   
better access to, and I end up having to do some extra goo to get things
   
working.

Q: What is missing from the Python ecosystem?
A:
(Anna Ravenscroft, Alex's wife, yells "Alex") But we've solved that problem.

   Today they are mostly using Python 2.2,  trying to figure out how to use
   
Python 2.3 -- big upgrade problem

Q:
How do you evangelize people who are happy with C++ and SQL and don't seem to
   
want to try Python?
A:
We make it easy to use any of the languages, and don't really force people to
   
use a different language. The different applications are based on what the
   
team understands best. We make it easy for all of these things to interact -
   
if you have a server written in Java we have a custom RPC system that helps
   
bridge the gap and communicate with other servers.

Q:
How many software engineers roughly does Google employ (Steve Holden)?
A:
I do know that the public employee count is over 3,000 employees as of
   
December, but I don't know the break-out in terms of numbers of engineers.
   
It's hundreds of engineers but I can't really say any more.
   Some of the apps written in Java (blogger) can communicate with C++ using
   
RPC, so not using Python is not a problem

Q:
You must have masses of linguistic data (terabytes). How do you access that
   
data so fast?
A: Yes.
I don't know, I don't work in that area.  As far as speed, "we just
   
throw servers at it."

Q: Within Google, is there anything for which Python is considered inappopriate?
A: Is there anything where Python is not appropriate?
Well yeah, something like
   
our indexing system where we scan the web pages and produce an index. Python
   
is good, and fast, and IronPython is even faster, but it's not fast enough.
   
We use C for that.

   For other things, it's based on the engineering team. We make it possible for
   
the teams to use what language they like.

   Personally, I'd like to see more Python, so some of the things I've been
   
doing have been working on enabling that.

Q:
What kind of bug-tracking system do you use?
A:
Bug tracking.  Our system is not that good.
   We have one, anybody in their right mind has one
   Bugzilla derivative
   MS has an awesome bug tracking system
   Even what I had at collab.net was better
   Google's looking at different options for fixing that system.

Q: I want to jump in with another comment on wrapping. I have a plotting library
   
in C++ with heavy use of templates and I tried wrapping it in three different
   
things (cxx, Boost, and SWIG). SWIG is actually pretty good now, swig
   
template support is much better than it used to be. Boost makes things way,
   
way too big.
A:
Based on this feedback it seems like Boost is capable in certain environments
   
and is definitely worth looking at. Need to evaluate before using.

Q:
SWIG performance in real time environment?
A:
It is a non-issue. However, I was challenged about this at MS: someone said
   
"Python won't be fast enough!" I said, "how fast does it have to be?  1000
   
pages per second?"  He couldn't say. So I said "then just don't worry about
   
unless it proves too slow."
   We did go ahead and rewrite some of Python the stuff into ActiveX COM objects
   
and ASP and... it was slower (laughter and applause).
   Much time in Python is spent outside the interpreter loop; much time is
   
spent, e.g., in the String object, which is written in C.

   [On code.google.com] There's still that Global Interpreter Lock in there, but  
   
I still saw some SERIOUS page performance on that thing. Don't be afraid of
   bringing
Python into your projects.. Your bottleneck will be the network
   
bandwidth (some person on a 56kbps line), not Python

Q: Mentioned a number of languages used at Google. We use Python because it's
   
terser (among other reasons). Can you speculate on lines of code in various  
   
languages at Google? (Do you even know total lines of code at Google?)
A:
I have no idea. It's a LOT.
   Joke from audience: the code counter is still running!
   C++ is probably the majority, probably followed by Python.
   C++, Python, Java - gut feeling

Q: Five years from now, if people are right about Moore's law, more
   
multiprocessor systems. What about the getting rid of the Global Interpreter
   
Lock project that you did a few years ago?
A:
Wow. Yeah, that was a few years ago.   Back in '96 I made a few patches to
   
Python 1.4 to get rid of the GIL.  We used that at MS to make free threaded
   COM objects. We were getting a lot of lock contention.
 We had to protect
   
different data structures - like in Python there are pools of frame objects
   
which had to be protected (??). Things were blocking around those pools. For
   
2 processors there was a bonus, but for 3 or 4 it was actually slower.

   Free threading - Python's thread state was one of the benefits from that set
   
of patches. sys.exc_info was another.

   
The Global Interpreter Lock hasn't actually been a problem.

Q:
Every once in a while, you are going to introduce a bug into the system. How
   
do you guys debug across the language boundaries?
A: We don't have any particular tools, or antyhing like that. Have libraries for    

   
logging. My favorite technique is adding print statements (applause/
   
laughter). It would be wonderful if we had special tools but we don't.

   
Some people ask what IDE they should use for cross language Java/Python
   
development. Eclipse is quite good, but even that doesn't have any cross-
   
language stuff.


Q:
Do you have any current hobby projects that you are working on that you can  
   talk about?

A:
Stuff outside Google they can't tell me not to talk about.
    Subversion based wiki (subwiki)
        
svn exposes its libraries to Python via SWIG
        
You could build a new svn client or interact with a server from Python
        ViewCVS does this
        subwiki uses the svn repository to store the wiki pages
    Googly stuff - mostly code.google.com

Q:
What does Google have to say about web application frameworks
A: It's a tough one. Lot of stuff set up in C++.
code.google.com was not built
   
using an off-the-shelf framework; we used Google's custom HTTP server.

   GMail is not written in Python. I don't actually know if it's C++ or Java. (Chris DiBona: it's Java.)

Q: Followup - is there anything that Google can contribute (via open source) in the web framework arena?
A: Got a lot of stuff we've been talking about moving into the open soruce arena. Stuff tends to build on itself; trying to get it untangled. Stuff relies on Google-specific stuff, won't be interesting outside of Google.
   

Q:
Tim O'Reilly talked about Google redefining applications.  In this view we're
   
sort of moving away from Google 1.0.  When you upgrade, what sort of staging
   
environment do you have?
A:
We definitely have staging environments. One of the things built in to the
   
systems I talked about for moving things out. The main web server -
   
www.google.com - is a BIG chunk of code and data - because we have
   
translations and stuff for everything. In any case, they're called canary
   
servers (chuckles from crowd) - we put stuff on the canary servers and see if
   
they're going to fall over. Also, because we get so much traffic we can turn
   
a knob and expose something like 1% of our traffic to those servers. If they
   
don't fall over, we expose some more.
   The "turning the knob" is a little command line tool written in Python.

Q:
(Alex Martelli) Prompted by your mention of unwrapping pieces so they can be
   
open source. It actually sounds like something that's a very good software
   
engineering exercise, because it forces decoupling from your proprietary
   
stuff. Even if we never open source the actual pieces, just having done the
   
unwrapping seems like a big advantage.
A: It would be a big advantage if we were dist
ributing code. For us, a 50 MB
   
executable is not a problem, though you'd never try to push that to a client
   
too often. While it would be an interesting engineering exercise and would
   
improve the code it has not been a priority.

Chris DiBona followup: Opening your code tends to make it better, for example in our (?)malloc library we said it worked faster for these situations, and when we looked at it we found a bug in our code.


--------------------------------------------------------------------------
REFERENCES: {as documents / sites are referenced add them below}
http://www.swig.org
http://code.google.com

--------------------------------------------------------------------------
QUOTES:
"We don't do that at Microsoft; we ship C++ code"
"Python passed the tipping point years ago"
"[You can] read [Python] in 2 hours, program in it in 2 days, be productive for the company in 2 weeks."
"We use a LOT of SWIG"
"We've got quite a few servers..." (laughter)
"I've worked in large environments before, but nothing on the order of this"
"We have a lot of log data"
"Today we're using primarily Python 2.2 deployed on our servers, but we're trying to work out how to move to Python 2.3."
"Our bug tracking system is not that good"
"Pushing bits out to some guy on a 56k modem IS your bottle neck. Pulling records out of a database is your bottleneck. It's very rarely going to by Python."
"I think we probably have more Python code than we have Java" - a guess
"I think we probably have more Python than we do Java, because of all of those tools and things for supporting the environment, wrappers and all these things."
"Mr. Ascher.  That's Dr. Ascher, to you."
"My favourite debugging environment is PRINT."

--------------------------------------------------------------------------
CONTRIBUTORS: {add your name, e-mail address and URL below}
Ted Leung <twl@sauria.com> <http://www.sauria.com/blog>
Linden Wright <lwright@mac.com>
Erik Rose <corp@grinchcentral.com>

Andy Wright
Nicholas Riley <nriley@sabi.net> <http://njr.pycs.net/>
Simon Willison <cs1spw@bath.ac.uk> <http://simon.incutio.com/>
Jonathan Blocksom <blocksom@gollygee.com>
Abhay Saxena <ark3@email.com>

--------------------------------------------------------------------------
E-MAIL BOUNCEBACK: {add your e-mail address separated by commas }



--------------------------------------------------------------------------
NOTES ON / KEY TO THIS TEMPLATE:
A headline (like a field in a database) will be CAPITALISED
    This differentiates from the text that follows
A variable that you can change will be surrounded by _underscores_
    Spaces in variables are also replaced with under_scores
    This allows people to select the whole variable with a simple double-click
A tool-tip is lower case and surrounded by {curly brackets / parentheses}
    These supply helpful contextual information.

--------------------------------------------------------------------------
Copyright shared between all the participants unless otherwise stated...