Wednesday, January 22, 2020

PHP, XML, and character encodings

The library where I work subscribes to an online service that keeps track of the library's ongoing public events. We link to it from our website so that patrons can discover the various programs the library has to offer.

I wrote a WordPress plugin that posts a listing of these events on the library's events page using the RSS feed the service provides. I just noticed today that a few of the event titles had empty, square boxes in them. When you see this, it usually means there is a character encoding problem, which means that it does not recognize a particular letter or symbol.


To make page loads more performant, my plugin caches the RSS file it downloads so that it does not have to query the event service every time someone loads the events page on the website. I opened the cached RSS file to see what might be causing the problem. The event in question had curly quotes (also called smart quotes) in its title. My experience has been that curly quotes frequently cause problems when people try to use them on websites, so I wasn't surprised to see this.


(table from computerhope.org)

You can't type a curly quote on the keyboard. At least, not directly. In my experience, they usually appear when someone copies and pastes something from Microsoft Word because Word will automatically insert them into your document as you're typing to make your document look more aesthetically pleasing.

It turned out that the problem was with the RSS data's character encoding. An RSS file is just XML, and every XML file has an "encoding" attribute at the top.

<?xml version="1.0" encoding="UTF-8" ?>

This tells the program that parses the XML what kind of character set the XML data uses so that all of its content will remain intact after being parsed. If this attribute does not reflect the character set that was actually used to create the XML file, the data may not be parsed correctly, and you may end up with "empty boxes" like the ones I was getting.

When I changed this encoding attribute to "UTF-8" (a widely used character encoding that supports many different languages), the empty boxes went away, and the curly quotes correctly appeared.


To prevent this from happening in the future, I modified my WordPress plugin to change the RSS feed's character encoding right before it is saved to the cache. Doing a simple str_replace function call seemed to do the trick. I thought I might have to use the mb_convert_encoding function to do a thorough conversion of the entire file, but this did not appear to be necessary.

No comments: