Ted Leung on the air

Ted Leung on the air
Ted Leung on the air: Open Source, Java, Python, and ...

Thu, 30 Jun 2005

WebServicesSummit.com has posted an excerpt of the Encryption and XML Security chapter from my book.

[00:24] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Fri, 28 May 2004

Today I wanted to manipulate some of the data stored in my Amazon wishlist. Thanks to this post by Kevin Donahue, it turned out to be easy.

[23:28] | [computers/programming/xml] | # | TB | F | G | 6 Comments |

Tue, 23 Dec 2003

Broke 100,000

Julie just popped in to tell me that my sales rank on Amazon is 89,461. That's a significant improvement from 700,000+. Of course, the distribution of sales on Amazon is likely to be a power law shaped curve, so the sledding only gets tougher up hill.

It's an odd experience shipping physical products. I finished writing the book in September and there've been little things to do here and there, but it's been mostly out of my mind since then because I've go so many other things going on. I've gotten so used to finishing something and having it go into use that it just feels weird to have this delay before the books go out. Even now it doesn't quite seem real yet, because I haven't seen a copy of the physical book yet. I hope that my copies are on the way.

I had a similar feeling when I as working on the Newton at Apple. Talk about lag time. We had to finish our software much earlier than I was used to, because that software was going into ROM, and there was lead time for that, coupled with the vacation schedule of our production partner in Asia. Working on something physical, like a piece of electronics, or a book is definitely a different feel from just tagging a bunch of files in CVS, jarring them up and pronouncing it done. Of course, not all software is done that way, but more and more software is getting done that way.

[00:05] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Thu, 18 Dec 2003

The book is real!

My book, "Professional XML Development with Apache Tools" has hit Amazon. So now you can go buy it... You'll probably get your copy before I get mine.

[23:38] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Fri, 31 Oct 2003

What is integrated XML support anyway?

Kimbro Staken is wishing for XML support baked directly into a programming language. His criteria are:

Seamless XML support. Never having to explicitly parse an XML document
XPath as a native language construct
Dynamic conversion between text and parsed representations of the XML.
XPath manipulation for XML modifications

While I like the goal, I wonder about the implementation. I'd rather be able to deal with XML elements and attributes being fully expressible -- That is I'd like to have XML element literals not just XML element variables (especially since variables have no type in Python, it's the values that do). I also wonder about the efficiency of using XPath for updates. The syntax that is presented for XPath modification implies that you have to execute the query before you can update the document -- there's nothing like a cursor to help you keep your place.

This topic is fresh on my mind, as I've just finished reading a series of papers on this very topic:

Martin Kempa and Volker Linnemann's PlanX '02 paper, On XML Objects describe XOBE a system implemented as a Java preprocessor, which aims to

eliminate the distinction between the string representation and the object representation of XML documents.

Their approach generates classes for each of the elements in an XML Grammer (DTD or Schema) and allows for object literals that look syntactically like XML. XOBE also allows XPath expressions for querying the resulting object hierarchies.

Erik Meijer and Wolfram Schulte's OOPSLA 2003 submission: Unifying Tables, Objects, and Documents takes a different approach. Meijer and Schulte show how to extend C# (it could just as easily be Java) to deal with relational and XML data. They set forth a number of design principles for their experimental language, but two of the most important are:

Denotable values should be (easily) expressible
Expressible values should be denotable

This is at the heart of not having to parse and/or re-serialize.

C# is extended to support streams of various lengths (this can be done easily in languages which have generators, like Python). It is also extended to allow tuples (heterogeneous structures of optionally labelled variables of fixed length). Python has unlabelled tuples built in. The combination of streams and tuples is used to model relational data. The last supporting extension is union types.

Streams, tuples, and unions are the unused to model a large part of the XML Schema (XSD) type system. This means that the programmer can declare classes that correspond to XSD types. C# is also extended with XML literals which can be stored in instances of the appropriate class

In the area of querying, Meijer and Schulte have departed from XPath syntax. Instead, the mechanisms for looping / querying are taken from functional programming: lifting, apply-to-all, and folds. This provides a nice mechanism that works for streams (relational data) and XML data. These mechanisms support accessing fields/members of objects to provide path expression / XPath-like navigation. The authors also provide wildcard, transitive, and type-based member access. Wildcard access returns all the members of a type (in declaration order), transitive access allows you to find a member that is transitively reachable from some other member. Type-based access allows you to restrict the type of the member being (transitively) searched for (the restriction is reminiscent of XPath axis notation).

In addition to functional style querying, this extension of C# can also support a SQL like select-from-where clause. The cool thing that they point out is that it doesn't matter where the queried data is: it can be in memory or on disk. This is the beauty of the stream abstraction.

The same authors (plus one), Erik Meijer, Wolfram Schulte, and Gavin Bierman have a paper in XML 2003 Programming with Circles, Triangles and Rectangles, which focuses on the XML aspects of the language described in the previous paper. The experimental language is now being called Xen. This paper goes into a lot more detail on the mismatch between the XML data model and object data models. There's less on streams and tuples (and the type rules), a little more on lifting, and examples of handling the XQuery use cases in Xen.

Ned Batchelder points out that the version linked above is encoded in MSIE only html. Bierman has made a friendlier version available (at least it renders in Mozilla).

It seems clear that everyone agrees that denotable values should be easily expressible and that expressible values should be denotable. What's not as clear is what the query model should be for a language that has XML support baked in. XPath is an obvious choice, but then you only get the functionality for XML data, and I'd like to have the same kind of query capability for any hierarchical data. That's what I like about the Xen approach. Of course, it is possible to make XPath work over objects, ala the ObjectXPathNavigator or JXPath...

I'm glad to see this work being published / discussed. It seems to me that Xen is most likely to make an appearance in commercial products, since Erik Meijer works in the WebData group at Microsoft, which is a product, not a research group. If you want to stay in the statically typed curly brace language world, it looks like Microsoft is kicking the tires in all the right places. If Xen becomes C#2006, then Java will be sucking wind (at least in my book).

[15:06] | [computers/programming/xml] | # | TB | F | G | 4 Comments |

Wed, 29 Oct 2003

The Essence of XML

Today I read Simeon and Wadler's POPL 2003 paper, The Essence of XML, which should really be titled "The Essence of XML Schema". The paper points out that XML is not a good data representation because it isn't self describing or round-trippable:

It is not always self-describing, since the internal format corresponding to an external XML description depends crucially on the XML Schema that is used for validation (for instance, to tell whether data is an integer or a string). And it is not always round-tripping, since some pathological Schemas lack this property (for instance, if there is a type union of integers and strings)

They are assuming a strongly typed world, of course. The key observation in their work is that using a named typing model instead of (more commonly used) a structural typing model leads to a model where proving theorems about validation and erasure is very easy. The model in the paper only models a subset of XML Schema, but the remaining features are accounted for in the formal semantics for XQuery and XPath 2.0.

For practitioners, the bottom line is that the subset of Schema described by the paper is roundtrippable (convert internal value to external and back -- output the value as XML and then parse/validate) except for simple types that are lists or unions. For the case of reverse-roundtripping (parse/validate then output), the only problems arise when a base type has multiple representations for the same value, like leading zeros or number bases.

[23:34] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Fri, 03 Oct 2003

Mark Nottingham throws down the XML editor gauntlet

All I can say is "I'm with Mark", except maybe for the part about Emacs...

[00:44] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Fri, 19 Sep 2003

Push me pull you

Diego's done a little benchmarking of Crimson (the parser in the JDK) and the StAX RI. StAX comes out way ahead, but there are still some things to wonder about: Can StAX validate? Crimson can, and is likely taking a hit for it. If he compared it to Xerces-J, I'd have to ask whether StAX can validate schema? One interesting comparison would be against NekoPull.

I'm excited about StAX, but having worked on Xerces-J, I also know that there are lots of factors that could be influencing the numbers. The thing that is clear, is that the code is a lot simpler. The performance stuff is just repeated application of solid engineering.

[12:39] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Thu, 18 Sep 2003

StAX + XMLBeans == Java XML APIs

Eliotte Rusty Harold has an XML.com article on StAX. It's a good introduction to this style of API.

He didn't like the use of integer type codes, preferring to have objects. I view StAX as a replacement for SAX, which is basically the bottom of the stack for XML processing API's. You can use SAX or StAX to build trees or beans or whatever you like. So any performance penalty that you impose on StAX is something you've imposed on everybody above you in the food chain, and there's no way to get that performance back if you need it. I think that they did the right thing here. The performance of reflection is never going to match integer comparison. Also, Andy Clark, the author of NekoPull is a member of the expert group as well.

As noted, the state management is much easier than SAX -- it allows you to write a more recursive-descent style of object construction. One of the problems with SAX is that it's very hard to use it to create objects in a way that allows you to reuse the objects and the SAX handler for building those objects. StAX makes that go away.

In many cases what you want is an easy to manipulate representation of the XML data. This is very frequently a Java Bean. So what you really want is an API that takes an XML stream, and gives me back the right Java Bean. And that's what XMLBeans is all about.

So in my view of the Java XML API world, there's StAX at the bottom, and on top of that you have XMLBeans. If you need to do weird document editing tree stuff, then there's StAX at the bottom, and on top of that there's the DOM or XOM or JDOM or whatever. But for most applications that I've worked on, the combination of StAX and XMLBeans would cover it. If you look at the XMLBeans Roadmap, you'll see that one of the goals is to support StAX (JSR 173), which would mean one stop shopping, er, downloading. My hope is that as XMLBeans evolves we'll be able to have a simple to use library.

[01:06] | [computers/programming/xml] | # | TB | F | G | 2 Comments |

Thu, 11 Sep 2003

XML Tools from all over

Uche Ogbuji's XML.com article on The State of the Python-XML Art, 2003 is good guide for people unfamiliar with Python's XML support, like me. I found some good pointers in there.
Micah Dubinko also has an article on XML.com, discussing Ten XForms Engines. I'd really like to see a Firebird plugin for this. And one that could generate a SVG / Flash .swf file given an XForms document.
From XMLhack, comes news of James Clark's GNU Emacs mode for XML nXML. nXML can do validation based on Relax NG schemas.

[01:03] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Tue, 09 Sep 2003

XML Phones

This CNET article is ostensibly about Cisco's color VOIP phone, but tucked in at then end is a bit of interesting information:

Like most VoIP phones, the 7970G will also perform Internet functions such as searching a database for work orders. It does so by running programs written in an emerging software language called XML (Extensible Markup Language) that's used to share data over networks.
Also Tuesday, Cisco said it will extend XML capabilities to its 7905G and 7912G VoIP phones by the end of the year. The phones are available now, selling for $135 and $165 respectively.

Does this have possibilities?

[01:32] | [computers/programming/xml] | # | TB | F | G | 2 Comments |

Mon, 01 Sep 2003

Lantern illuminates XPath

Lantern is a tool for visualizing the results of XPath queries. This looks more functional than XPath Explorer, which is what I've been using.

[00:34] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Tue, 19 Aug 2003

XML omnibus: demystification, RelaxNG, Postel's law

You know we're in trouble when we need an XML Acronym Demystifier. Things are really getting out of hand.
Tim Bray is extolling RelaxNG after his experiences writing a schema for Pie/Echo/Atom/Whatever. I agree that RelaxNG is technically superior to XML Schema. I've all but given up on RelaxNG supplanting XML Schema as the majority schema language for XML. I suppose that's mostly because I'm looking at it through the lens of Xerces and the disinterest we've had from RelaxNG knowledgeable folks in the past.
Something needs to happen for RelaxNG to pick up momentum. Right now, it's just another one of those acronyms that we need the demystifier for. Until the RelaxNG community starts showing up to projects and helping to build RelaxNG support, RelaxNG will remain obscure. A few years ago I tried to use the leverage that we had with Xerces-J to try and advance RelaxNG, but no one was interested enough to help us. If the RelaxNG folks don't care enough to get RelaxNG support into the most popular processors in all the major languages, then they can't expect the rest of us to care.
I partially disagree with Aaron Swartz and Mark Pilgrim (and probably a host of others) on Postel's law. All the examples that Aaron gives are related to documents that users might process (this includes RSS feeds, where I've been bitten by his reason #1). I agree that in those situations, it's desirable for the parser to be Postel compliant (i.e. lenient). But in situations where the XML is generated for machines, by machines, I think that the XML spec writers were correct. Let them eat fatalErrors, because there's a bug in a program somewhere that needs to get fixed.
One of the motivations for the Xerces Native Interface was to allow us to build families of parsers by replacing pieces of the framework. So you could implement a strict XML parser (which is what XMLDocumentScannerImpl is) or you could implement a Postel XML parser. Only problem is that nobody cares enough to implement a Postel XML parser.

[01:33] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Sat, 16 Aug 2003

My current project unveiled...

I've been hinting at my current project here and there. Today my agent told me that I am clear to talk about it, so since a picture is worth a thousand words:

Professional XML Development with Apache Tools : Xerces, Xalan, FOP, Cocoon, Axis, Xindice

The goal of this book is to help people use the unique features of various XML related projects at the ASF, mostly those that were a part of xml.apache.org at the beginning of the year. I'm pretty well along, but if there's something that you always wanted to know about some Apache XML tool, drop me a note, comment, or trackback, and I'll see what I can do.

Thus far, writing a book has turned out to be a bigger experience than I thought, but so far I'm enjoying it. I was really able to identify with Erik Hatcher's post today.

[00:49] | [computers/programming/xml] | # | TB | F | G | 3 Comments |

Fri, 04 Jul 2003

XMLBeans going open source

Steven's already posted his opinion on XMLBeans moving to Apache. He's made himself an ASF sponsor of the proposal to bring XMLBeans to the ASF. Steven's post focused on the political end of the proposal. Most of his reasoning would be subsumed by my repeated postings here, so I won't repeat myself.

I'm also interested in seeing XMLBeans go open source, even if it ultimately doesn't end up at the ASF. The reason is simple. Most of the API's that we have for dealing with XML suck. I think that we can do a lot better, and XMLBeans looks pretty good. The open source community needs to start thinking about targeted areas for improvement. The whole XML API area is ripe for this. Having XMLBeans go open source is much better than having it go to the JCP. I'm planning to try and help shepherd this through the Apache incubator.

[22:31] | [computers/programming/xml] | # | TB | F | G | 1 Comments |

Mon, 30 Jun 2003

Tagsoup, NekoHTML

Norman Walsh is touting John Cowan's TagSoup as a solution for HTML escaping. Andy Clark's NekoHTML is also a solution in this space.

[12:32] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Sun, 22 Jun 2003

How far we've come

If you want your head to spin, check out this picture of the state of XML.

[01:16] | [computers/programming/xml] | # | TB | F | G | 0 Comments |

Sat, 07 Jun 2003

What does equivalence mean for XML?

Dare is back and writing like a madman. There's one aspect of comparing XML that I'm not sure he quite picked up, though

If I get arbitrary XML as input to my method and want to compare whether the XML fragment is equivalent to another XML fragment there isn't an easy way to do this without converting both in-memory representations to their string representations (if they aren't strings already) and doing a string comparison. Hmmmm...

String comparison is not nearly enough to determine equivalance. Aside from the namespace prefix mapping problem mentioned by Harry, there are problems related to order of attributes, whitespace between elements, and empty elements, to name a few. The Canonical XML recommendation has a good list of the issues. You have to decide what equivalence means when you are talking about XML derived data. If you want equivalence to mean "the objects that have been serialized as these two XML documents are equal in C#/Java", then you have more work to do at the "string comparison" level.

[14:15] | [computers/programming/xml] | # | TB | F | G | 2 Comments |

<	June 2005					>
Su	Mo	Tu	We	Th	Fr	Sa
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30