Ted Leung on the air
Ted Leung on the air: Open Source, Java, Python, and ...
Tue, 13 Jan 2004
Atom, nuts, and bolts.
The debate over liberal parsing of Atom feeds is interesting. Of course, both sides have their points. The folks advocating liberal parsing believe that they are advocating for the users, in order to spare them the pain of bad feeds. The problem with this approach is that once you start accepting bad data, the data remains bad. This is particularly a problem when RSS/Atom data gets reformatted or repurposed. A great example of this is the RSS feed for Planet Lisp, which as Mark pointed out in the comments, doesn't validate. One of the problems with that feed is that it has a bunch of non-conforming dates in it. Poor Xach is just aggregating that feed data -- get got those dates from someone downstream. If he's using the Planet code, he's using Mark's liberal parser to do part of the dirty work.

So in this case, the liberal parsing camp says, look the feed is invalid, but the user doesn't care, just read it. Ok, we can do that. But look at why the feed is bad. It's a cascade of people who said exactly that. The users doen't care, but I bet Xach does, now that I passed Mark's feedback on to him. The users don't care about the pointy brackets, true. But they do care about the visible benefits of the pointy brackets. As more and more RSS data gets extracted, sliced, and diced, there are going to be more of these downstream accumulations of error. When you start reusing / extracting from this data, there will be visible benefits.

This bit me in Chandler in the performance tests, because we use DBXML as part of the implementation of the item repository. So we had a bunch of bad feed data that was blowing up the store. So before I used Mark's feedparser to get all the data out of the feeds, I first ran the feed data through a real XML parser and then threw out all the invalid feeds.

One way to look at XML compliance of specs is specifying the ability to pop things together. That is a benefit, and it is a user benefit, although the user that benefits may not be the user of standalone aggregators. Interoperability is important. If I go to the store for a 1/2 inch screw and they sell me a 5/8 inch screw because nuts "parse liberally", well, that's just not going to work.

Lots of folks are talking about this issue as if aggregators are the only users of RSS data, and I don't think that's going to be the case forever.

In the end, there's only one way to find out. Let's see what happens to the people who decide to parse strictly. If the liberal folks are right, then folks like Brent Simmons and Nick Bradbury will be out of business, because people will badmouth their strict aggregators and stop using/buying them. If not, a whole bunch of feeds will get cleaned up, and after a period of whining and tool building, the whole problem will be solved. If the authors of the major aggreators (and on my blog that's (roughly) NNW, SharpReader, FeedDemon, Radio Userland, feedreader, BottomFeeder, Straw, and RSSBandit) all decide to go strict, then I think that the feed producers can be brought into line pretty fast. If the feeds actually include good data about the feed producers, then strict aggregators could even auto-report bad feeds to the feed producer via email, along with a url to the feedvalidator like the one Mark posted for Planet Lisp. That's aggregator side tooling.

We also need author side tooling. My feed breaks periodically because I write entries by typing HTML into a very dumb CGI script. That's one reason I want to get something like Ecto or the NNW weblog editor working. And if those folks do their job properly, I'll never generate an invalid feed again.

So I say lets do an experiment. Brent and Nick are willing to bet their livelihood on it -- the rest of us are just talking.

[21:47] | [computers/internet/microcontent] | # | TB | F | G | 4 Comments | Other blogs commenting on this post

You can't really pit liberal agregators against strict aggregators.  If a few popular aggregators act liberally, there will be a significant number of invalid feeds out there.  If you disagree, take a look at HTML.

If, on the other hand, all the popular aggregators acted strictly, how many invalid feeds will arise?  I'm betting next to none - and any that inadvertently become invalid will be fixed very quickly.

Mark's example of being stranded with a publishing tool and invalid Trackback isn't very relevant, in my opinion.  In a strict-only or strict-dominant world, letting bad data into the feed would be a shopstopping bug, not a bug that is likely to be let into a production system.  And if a vendor did let a bug like that into their production system, they'd think twice about making their whole admin section rely upon rendering the feed content in future.  Validating content before allowing it into the system is relatively simple.  Dealing with loads of feeds that are malformed in different ways is not.

Postel's Law is not something to be thrown away without very good reason.  But look at how many workarounds have to be put into an HTML parser to deal with the amount of crud people are putting into the web today.  If aggregators remain strict from the beginning, the amount of crud will remain at a controllable level.  If they are liberal, it's liable to spiral out of control.

Postel's Law works both ways - if the feed consumers have reason to believe the feed producers won't hold up their end of the bargain (being strict in what they produce), they should be able to take preventative measures (by increasing the cost of being lax).
Posted by
Jim Dabell at Wed Jan 14 15:24:57 2004

You can subscribe to an RSS feed of the comments for this blog: RSS Feed for comments

Add a comment here:

You can use some HTML tags in the comment text:
To insert a URI, just type it -- no need to write an anchor tag.
Allowable html tags are: <a href>, <em>, <i>, <b>, <blockquote>, <br/>, <p>, <code>, <pre>, <cite>, <sub> and <sup>.

You can also use some Wiki style:
URI => [uri title]
<em> => _emphasized text_
<b> => *bold text*
Ordered list => consecutive lines starting spaces and an asterisk





Remember my info?

twl JPG


Ted Leung FOAF Explorer

I work at the Open Source Applications Foundation (OSAF).
The opinions expressed here are entirely my own, not those of my employer.

Creative Commons License
This work is licensed under a Creative Commons License.

Now available!
Professional XML Development with Apache Tools : Xerces, Xalan, FOP, Cocoon, Axis, Xindice
Technorati Profile
PGP Key Fingerprint
My del.icio.us Bookmarks
My Flickr Photos

RSS 2.0 xml GIF
Comments (RSS 2.0) xml GIF
Atom 0.3 feed
Feedburner'ed RSS feed

< January 2004 >
     1 2 3
4 5 6 7 8 910


Macintosh Tips and Tricks

Blogs nearby
geourl PNG

/ (1567)
  books/ (33)
  computers/ (62)
    hardware/ (15)
    internet/ (58)
      mail/ (11)
      microcontent/ (58)
      weblogs/ (174)
        pyblosxom/ (36)
      www/ (25)
    open_source/ (145)
      asf/ (53)
      osaf/ (32)
        chandler/ (35)
        cosmo/ (1)
    operating_systems/ (16)
      linux/ (9)
        debian/ (15)
        ubuntu/ (2)
      macosx/ (101)
        tips/ (25)
      windows_xp/ (4)
    programming/ (156)
      clr/ (1)
      dotnet/ (13)
      java/ (71)
        eclipse/ (22)
      lisp/ (34)
      python/ (86)
      smalltalk/ (4)
      xml/ (18)
    research/ (1)
    security/ (4)
    wireless/ (1)
  culture/ (10)
    film/ (8)
    music/ (6)
  education/ (13)
  family/ (17)
  gadgets/ (24)
  misc/ (47)
  people/ (18)
  photography/ (25)
    pictures/ (12)
  places/ (3)
    us/ (0)
      wa/ (2)
        bainbridge_island/ (17)
        seattle/ (13)
  skating/ (6)
  society/ (20)

[Valid RSS]

del.icio.us linkblog



Listed on BlogShares

Locations of visitors to this page
Where are visitors to this page?

pyblosxom GIF