Tom Morris

16 November 2009

A pungent mix of programming, philosophy, pedanticism, procrastination, perplexity, peripheral political polemic, and platters of preposterousness.

When parsing HTML using regex is okay

There’s been a lot of fuss over on Stack Overflow, and consequently on Metafilter and on Jeff Atwood’s twitter, about people parsing HTML with regular expressions, along with the advice to never do that and tales of how Cthulhu will eat your soul.

In general, never parsing HTML with regular expressions is good advice. That’s good advice in general.

But sometimes it isn’t. I’ll give an example case of when you shouldn’t. You may find that it’s applicable to you.

A while back, I had over 2Gb of HTML to parse - 77,000 files. Every file was exactly the same structure. I only wanted to extract two pieces of data from each file - the contents of the h1 element and the contents of a div with the class of ‘author’ or something similar.

I wrote some Ruby code to parse each page using Nokogiri or Hpricot or whatever was then the preferred HTML parsing library. But this was slow. It was taking about 4 or 5 seconds to parse each file. In general, that’s pretty fast, but when you’ve got 77,000 to do, that’s not so good. That means four days.

I rewrote the code in Java so that it would open each file with a BufferedReader, then readLine on each line of the file, using the String startsWith method to see if it’s the right line, then use regexes to extract the stuff we are interested in. I compiled and ran this code: it went from four days to about ten minutes. Which is fine because I made a goof-up in the code that I only discovered after running it - if I had only discovered that goof-up four days later, I would have been a lot more angry than if I’d discovered it after ten minutes.

I’ve told this story to people, and there seems to be two possible reactions. There is the “OMG Ruby is so slow, I knew that not learning it and sticking with Java was sensible” reaction, and there’s the sensible reaction - I could have re-written it in Ruby and gotten the same performance benefits by using IO rather than the XML/HTML parsing library - it just happens that I know the Java IO library better than I know the Ruby IO library. Part of what was probably taking the time in Ruby was the fact that I was constructing a large number of objects extremely quickly, but Ruby’s GC is notoriously painful in a non-generational way compared to the JVM’s generational GC.

The key thing is whether or not you are working with files that are all structured in a broadly similar way. If you’ve got 77,000 files that are all very similar and you know exactly what you want from them, sometimes for performance, parsing it as a bag of lines and strings is much more sensible than parsing it into a DOM. These very limited circumstances really provide the exception that proves the rule. If you don’t have a very good reason to be parsing XML or HTML using an XML or HTML parsing library rather than using regexes, you shouldn’t be doing so. (The same is true with RDF: use the right level of abstraction - unless you are logged into the swig IRC room all day every day and know the RDF specs like the back of your hand, you should be using an RDF library not an XML library to parse RDF documents.)

Faithful swell numbers and power by counting pension collectors and stamp buyers as religious

From the Torygraph: The villagers of Kinoulton in Nottinghamshire have breathed new life into their church by introducing into it a cafe and post office.

Great. This means that now if you want to post a letter, collect your pension or benefits, buy some stamps or renew your road tax, you are now counted as a church-goer, swelling the influence of the church. Imagine the outrage if you had to go into a mosque or synagogue to do these things. Spending time in other people’s religious buildings erodes my epistemic credibility!

It’s all part of the dizzyingly anti-secular crapness of British society. See also Faith groups to be key policy advisers: Mr Denham argued that Christians and Muslims can contribute significant insights on key issues, such as the economy, parenting and tackling climate change.

Denham always struck me as being a bit soft in the ‘ed, even before he took the inherently soft-headed role of “communities minister”. I mean, fans of funk music, Java programmers, philosophy graduate students, Twitter users, people with beards and residents of Sussex can probably also contribute significant insights on the economy, parenting and climate change, but we don’t give them a special committee in government (much to my disappointment, as I’d be happy to serve on any of them). Christians and Muslims (and every person of every faith or none) already have a role in government and decision making: they can vote, they can lobby, they can assemble freely with their fellow citizens, they can organise themselves into pressure groups and political parties for the purpose of lobbying.

Tags:

If Apple implements this on the Macintosh or iPod without opt-out, I will sell my Macintosh as soon as is practical, and sell my iPod. Really, there isn’t much that’s keeping me on OS X anymore. TextMate is nice, but I can get by with Vim (I use MacVim more than I use TextMate, but also use GVim and Vim on the console). So long as the machine I’m using has a browser, git, the JVM and the run-times of the programming languages I use, I can be productive. I’ve still yet to find a decent alternative to the iTunes and iPod workflow. This is something I really wish someone would figure out - produce something compelling like the iPod synchronisation model rather than something really weak like OpenOffice and you, dear Linux community, shall have my laptop and desktop OS. (I know, yes, I should go and write some code for Banshee or amaroK or whatever. Patch or GTFO as they say.)

Tags: