Ted Leung on the air
Ted Leung on the air: Open Source, Java, Python, and ...
Wed, 23 Feb 2005
RSS Aggregators are the killer app

Scoble asked me to write this post, so here goes. I don't mean that RSS aggregators are the kind of killer app that sells a billion computers and creates new markets (there is that possibility, though). I mean the app that does so much that it consumes all available CPU, memory, network, and disk. Perhaps I really mean that they're the "killing my computer" app.

If I take out software development activities, the application that is pushing the limits of my hardware is my RSS aggregator. This is not in any way a slam on NetNewsWire, which is very, very, fine application. It's a reflection of the way that my relationship to the web has changed. I hardly use a standalone browser anymore -- mostly for searching or printing. I don't have time to go and visit all the web sites that have information that is useful to me. Fortunately, the aggregator takes care of that. Once the aggregator has the information, I want it to fold, spindle, and mutilate it. I'm at over 1000 feeds, and on an average day, it's not uncommon to have 4000 new items flow through the aggregator. It takes 25 minutes (spread out over two sessions) just to pull the data down and process it -- and I have a very fast connection. NetNewswire uses WebKit to render HTML in line -- a feature that makes it easy to cut through piles of entries, but one which is demanding of CPU and memory.

But that's just the basics. What happens when we start doing Bayesian stuff on 4000 items a day? Latent Semantic Indexing? Clustering? Reinforcement Learning? Oh, and I want to do all of those things on all the stuff that I ever pulled down, not just the new stuff. What happens if I want to build a "real-time" trend analyzer using RSS feed data as the input? The processor vendors should be licking their chops...

[00:04] | [computers/internet/microcontent] | # | TB | F | G | 11 Comments | Other blogs commenting on this post






The point is not to read 5000 or so blogs on a daily basis. The point is that once you have tools that are good at managing large amounts of data (through things like implicit/explicit scoring and dynamic filtering) then the cost of adding yet another feed is insignificant. If the cost of adding new feeds is insignificant then everyone will end up subscribed to large amounts of feeds within a couple of years.

People don't sit in a library and read every new book that is published as it comes in. Instead we trust that the things we want will be there if and when we need them. We can do this because we know that a library has mechanisms that will let us find the relevant content and ignore the irrelevant.

Tools like Aggrevator are meant to stop the, frankly silly, scenario wherein people try to read everything that gets published by their peer group  as it comes out. People only feel this manic rush because their aggregators throw away entries after a set period of time.

Tools that retain everything (and somehow solve the painful scalability issues associated with that) can start to offer people useful ways of analysing, manipulating and managing that data.

For instance I can take advantage of the power-law distribution of scores amongst those 5000 feeds and only read the top 20 because are clearly my favourites at the moment. I could imagine ever more sophisticated features like cluster analysis, vector space search or contextual network graphs (but not bayesian classification or latent semantic analysis because they just don't scale well enough for desktop usage) being added over time.
Posted by
ade at Wed Feb 23 17:35:01 2005




I imagine you're checked out Findory's blog aggregator which tries a collaborative filtering approach to finding interesting articles in the sea of blogs? It sounds like you're planning something along those lines for yourself?

(Personally, I've tried Findory a few times and haven't found that it the articles it surfaces in response to my reading pattern are all that interesting to me. When I use BlogLines I find myself oversubscribing and watching the backlog pile up, which I'm not crazy about.)
Posted by Maarten at Fri Feb 25 12:26:05 2005

You can subscribe to an RSS feed of the comments for this blog: RSS Feed for comments

Add a comment here:

You can use some HTML tags in the comment text:
To insert a URI, just type it -- no need to write an anchor tag.
Allowable html tags are: <a href>, <em>, <i>, <b>, <blockquote>, <br/>, <p>, <code>, <pre>, <cite>, <sub> and <sup>.

You can also use some Wiki style:
URI => [uri title]
<em> => _emphasized text_
<b> => *bold text*
Ordered list => consecutive lines starting spaces and an asterisk

Name:


E-mail:


URL:


Comment:


Remember my info?


twl JPG

About

Ted Leung FOAF Explorer

I work at the Open Source Applications Foundation (OSAF).
The opinions expressed here are entirely my own, not those of my employer.

Creative Commons License
This work is licensed under a Creative Commons License.

Now available!
Professional XML Development with Apache Tools : Xerces, Xalan, FOP, Cocoon, Axis, Xindice
Technorati Profile
PGP Key Fingerprint
My del.icio.us Bookmarks
My Flickr Photos


Syndicate
RSS 2.0 xml GIF
Comments (RSS 2.0) xml GIF
Atom 0.3 feed
Feedburner'ed RSS feed

< February 2005 >
SuMoTuWeThFrSa
   1 2 3 4 5
6 7 8 9101112
13141516171819
20212223242526
2728     

Archives
2006
2005
2004
2003

Articles
Macintosh Tips and Tricks

Search
Lucene
Blogs nearby
geourl PNG

Categories
/ (1567)
  books/ (33)
  computers/ (62)
    hardware/ (15)
    internet/ (58)
      mail/ (11)
      microcontent/ (58)
      weblogs/ (174)
        pyblosxom/ (36)
      www/ (25)
    open_source/ (145)
      asf/ (53)
      osaf/ (32)
        chandler/ (35)
        cosmo/ (1)
    operating_systems/ (16)
      linux/ (9)
        debian/ (15)
        ubuntu/ (2)
      macosx/ (101)
        tips/ (25)
      windows_xp/ (4)
    programming/ (156)
      clr/ (1)
      dotnet/ (13)
      java/ (71)
        eclipse/ (22)
      lisp/ (34)
      python/ (86)
      smalltalk/ (4)
      xml/ (18)
    research/ (1)
    security/ (4)
    wireless/ (1)
  culture/ (10)
    film/ (8)
    music/ (6)
  education/ (13)
  family/ (17)
  gadgets/ (24)
  misc/ (47)
  people/ (18)
  photography/ (25)
    pictures/ (12)
  places/ (3)
    us/ (0)
      wa/ (2)
        bainbridge_island/ (17)
        seattle/ (13)
  skating/ (6)
  society/ (20)



[Valid RSS]

del.icio.us linkblog

www.flickr.com

Blogroll

java.blogs
Listed on BlogShares

Locations of visitors to this page
Where are visitors to this page?


pyblosxom GIF