[Adium-devl] O'Reilly XML blog article: Parsing XML… backwards?
Graham Booker
adium at cod3r.com
Wed Mar 14 13:31:08 UTC 2007
On Mar 14, 2007, at 1:48 AM, Peter Hosey wrote:
> Found this in my referer logs:
>
> http://www.oreillynet.com/xml/blog/2007/03/parsing_xml_backwards.html
>
> It's an article about LMX and the various ways it's a Bad Idea.
> Some are better than others, but anyway, the article is definitely
> worth a read. Also, I have a comment in there.
>
> He makes a good suggestion:
>
>> You can write multiple well-formed XML documents to a single file,
>> following each one by a binary trailer that gives the size of the
>> last chunk of XML. Then it is trivial for code to jump backwards
>> through the file, grabbing a little document each time and passing
>> it to a real XML parser.
>
> This is an interesting idea. It would, essentially, be an archive
> of mini-XML-documents (which I suppose would be a bit like
> Colloquy's envelope element), which we could easily seek in reverse.
>
Putting binary at the end of an XML file? What is the advantage of
this? You have a constant number of characters at the end of the
file to read in order to know how many to read to get the last
element? I can't find any other.
If you wanted to do something along these lines, then you simply put
a single, self closed element, of a certain name, as the last child
of the root element, with a single attribute which is a 0 padded hex
number. Then, the end of the file is just:
<lastelement size="000000a0" /></root>
Then, it is still valid XML, you know the name of the root element by
parsing the beginning of the file, and you know exactly how much to
read at the end of the file, give or take a few whitespace characters
or so. Now, given this modification should work, why not put this at
the beginning of the root element, then you can simply use a normal
XML parser to read this value, and since the size is 0 padded hex,
then it can be modified to contain any 32-bit number without changing
any characters following. So, you would have:
<?xml....
<chat xmlns.....
<lastelement size="000000a0" />
....
Isn't this a far better idea? Since the only information is within
the attribute, any parser which doesn't know what to do with the
element simply throws it out. Even an HTML parser will ignore it.
Just don’t put binary in an XML document! (à la his final statement).
> The downside is that it wouldn't work well with most existing XML
> tools—we couldn't simply slurp a log file and pass it to
> NSXMLParser, WebKit, or anything else, without preprocessing it to
> remove those size markers. OTOH, it wouldn't be terribly hard to
> write such a preprocessor. XSLT could do the job.
>
I think my above mentioned solution will be better, then there is no
need. Plus, if you needed to eliminate them, writing an XSLT parser
to do the job is much easier. Besides, XSLT parsers expect the input
to be valid XML.
> The other downside is that we already have ULF and LMX; this would
> be yet another log format, whose main reason for existence would be
> the fact that LMX won't work 100% of the time with XML from the
> sort of people who name their elements “hello--”.
>
Downside number 2: He suggests using this to know the size of the
last element or elements. We wish to use this to read the last N
messages, but what about the problem that N is *NOT* known at the
time the file is written. Say the user wants to see the last 1
message when writing the log file, then changes the pref to 10 after
the log is closed. Then when they open the chat to that user again,
they only see the last message. Not terribly detrimental, but still,
something we have to give up.
> I'm inclined to stay with ULF, but I wanted to bounce it off you guys.
I say stick with ULF. Encoding and DOCTYPE are trivial to solve by
simply reading the beginning of the document first. Comments can be
an issue, but one that is solvable if we really cared (although we
loose constant time parsing). Also, the case he mentioned is not
going to be an issue since no elements end with "--" Processing
instructions should not exist in ULF, so that issue wouldn't affect
us. Documents which are not well formed will have issues everywhere.
Lastly, we aren't using LMX to append, but this is still a non-
problem. Read the close of the root element, write you new data over
it, and then close it again. This is what AIXMLAppender is doing
(minus the read part).
Also, there is a lot to be said for the usefulness of a library which
can parse nearly all XML documents in reverse order. As long as its
limitations are well documented, it has uses.
> ________________________________
> \ Peter Hosey / prh at boredzo.org
> PGP public key ID: C6550423 (since 2007-01-01)
- Graham
More information about the devel
mailing list