[Adium-devl] O'Reilly XML blog article: Parsing XML… backwards?

Wed Mar 14 13:31:08 UTC 2007

On Mar 14, 2007, at 1:48 AM, Peter Hosey wrote:

> Found this in my referer logs:
>
> http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html
>
> It's an article about LMX and the various ways it's a Bad Idea.  
> Some are better than others, but anyway, the article is definitely  
> worth a read. Also, I have a comment in there.
>
> He makes a good suggestion:
>
>> You can write multiple well-formed XML documents to a single file,  
>> following each one by a binary trailer that gives the size of the  
>> last chunk of XML. Then it is trivial for code to jump backwards  
>> through the file, grabbing a little document each time and passing  
>> it to a real XML parser.
>
> This is an interesting idea. It would, essentially, be an archive  
> of mini-XML-documents (which I suppose would be a bit like  
> Colloquy's envelope element), which we could easily seek in reverse.
>

Putting binary at the end of an XML file?  What is the advantage of  
this?  You have a constant number of characters at the end of the  
file to read in order to know how many to read to get the last  
element?  I can't find any other.

If you wanted to do something along these lines, then you simply put  
a single, self closed element, of a certain name, as the last child  
of the root element, with a single attribute which is a 0 padded hex  
number.  Then, the end of the file is just:
<lastelement size="000000a0" /></root>
Then, it is still valid XML, you know the name of the root element by  
parsing the beginning of the file, and you know exactly how much to  
read at the end of the file, give or take a few whitespace characters  
or so.  Now, given this modification should work, why not put this at  
the beginning of the root element, then you can simply use a normal  
XML parser to read this value, and since the size is 0 padded hex,  
then it can be modified to contain any 32-bit number without changing  
any characters following.  So, you would have:

<?xml....
<chat xmlns.....
<lastelement size="000000a0" />
....

Isn't this a far better idea?  Since the only information is within  
the attribute, any parser which doesn't know what to do with the  
element simply throws it out.  Even an HTML parser will ignore it.

Just don’t put binary in an XML document! (à la his final statement).

> The downside is that it wouldn't work well with most existing XML  
> tools—we couldn't simply slurp a log file and pass it to  
> NSXMLParser, WebKit, or anything else, without preprocessing it to  
> remove those size markers. OTOH, it wouldn't be terribly hard to  
> write such a preprocessor. XSLT could do the job.
>

I think my above mentioned solution will be better, then there is no  
need.  Plus, if you needed to eliminate them, writing an XSLT parser  
to do the job is much easier.  Besides, XSLT parsers expect the input  
to be valid XML.

> The other downside is that we already have ULF and LMX; this would  
> be yet another log format, whose main reason for existence would be  
> the fact that LMX won't work 100% of the time with XML from the  
> sort of people who name their elements “hello--”.
>

Downside number 2:  He suggests using this to know the size of the  
last element or elements.  We wish to use this to read the last N  
messages, but what about the problem that N is *NOT* known at the  
time the file is written.  Say the user wants to see the last 1  
message when writing the log file, then changes the pref to 10 after  
the log is closed.  Then when they open the chat to that user again,  
they only see the last message.  Not terribly detrimental, but still,  
something we have to give up.

> I'm inclined to stay with ULF, but I wanted to bounce it off you guys.

I say stick with ULF.  Encoding and DOCTYPE are trivial to solve by  
simply reading the beginning of the document first.  Comments can be  
an issue, but one that is solvable if we really cared (although we  
loose constant time parsing).  Also, the case he mentioned is not  
going to be an issue since no elements end with "--"  Processing  
instructions should not exist in ULF, so that issue wouldn't affect  
us.  Documents which are not well formed will have issues everywhere.

Lastly, we aren't using LMX to append, but this is still a non- 
problem.  Read the close of the root element, write you new data over  
it, and then close it again.  This is what AIXMLAppender is doing  
(minus the read part).

Also, there is a lot to be said for the usefulness of a library which  
can parse nearly all XML documents in reverse order.  As long as its  
limitations are well documented, it has uses.

> ________________________________
> \ Peter Hosey / prh at boredzo.org
> PGP public key ID: C6550423 (since 2007-01-01)

- Graham