Character encoding hell
I’ve found something new to loath - PHP character encoding issues. I use the DOM functions in PHP to manipulate the XML that stores the data on this blog, only PHP seems incapable of doing anything even vaguely sane with the data. utf8_encode was the first thing I tried, and doesn’t really help. The problem is that I occasionally copy quotes from websites that contain things like emdashes and endashes. When I post them to the server, I get all sorts of problems with character encoding - usually ending in the XML library not outputting anything and deleting whatever I’d posted that day. I would then have to load up the RSS feed, convert the data back out into plain text with HTML and repost it. What a performance. I read messy78’s post on character encoding, but since I’m not using the XML parser but the DOM, that didn’t seem relevant - and I looked at Character Encoding Issues. After a bit of testing, it looks like the Iconv library solves the problem. Currently, I run the following:
htmlentities(iconv("", "ISO-8859-1//TRANSLIT", $string))
I hope this will solve the problem. But PHP hasn’t exactly made it easy. It’s only shown to me why all software and all formats needs to support full Unicode, now. Repeat after me: Unicode now. Of course, everybody’s favourite RDF format, Notation 3, requires Unicode.