Wednesday, January 22, 2020

PHP, XML, and character encodings

The library where I work subscribes to an online service that keeps track of the library's ongoing public events. We link to it from our website so that patrons can discover the various programs the library has to offer.

I wrote a WordPress plugin that posts a listing of these events on the library's events page using the RSS feed the service provides. I just noticed today that a few of the event titles had empty, square boxes in them. When you see this, it usually means there is a character encoding problem, which means that it does not recognize a particular letter or symbol.


To make page loads more performant, my plugin caches the RSS file it downloads so that it does not have to query the event service every time someone loads the events page on the website. I opened the cached RSS file to see what might be causing the problem. The event in question had curly quotes (also called smart quotes) in its title. My experience has been that curly quotes frequently cause problems when people try to use them on websites, so I wasn't surprised to see this.


(table from computerhope.org)

You can't type a curly quote on the keyboard. At least, not directly. In my experience, they usually appear when someone copies and pastes something from Microsoft Word because Word will automatically insert them into your document as you're typing to make your document look more aesthetically pleasing.

It turned out that the problem was with the RSS data's character encoding. An RSS file is just XML, and every XML file has an "encoding" attribute at the top.

<?xml version="1.0" encoding="UTF-8" ?>

This tells the program that parses the XML what kind of character set the XML data uses so that all of its content will remain intact after being parsed. If this attribute does not reflect the character set that was actually used to create the XML file, the data may not be parsed correctly, and you may end up with "empty boxes" like the ones I was getting.

When I changed this encoding attribute to "UTF-8" (a widely used character encoding that supports many different languages), the empty boxes went away, and the curly quotes correctly appeared.


To prevent this from happening in the future, I modified my WordPress plugin to change the RSS feed's character encoding right before it is saved to the cache. Doing a simple str_replace function call seemed to do the trick. I thought I might have to use the mb_convert_encoding function to do a thorough conversion of the entire file, but this did not appear to be necessary.

Monday, January 20, 2020

Date modified timestamp problems on a Linux SMB share

I discovered an annoying issue recently related to our switch from a Windows-based file server to a Linux-based one.

The computer lab in the library where I work uses an Access database to record registration information of students who sign up for the adult computer classes that the lab runs. Several years ago, I wrote a Java Swing application that sits on top of this database to make the registration process faster and easier. I've gotten many complements about it, and it's been working well over the years with few hiccups.

The database file is hosted on a dedicated server that I administer, which the lab uses for file storage. This means that multiple people could be accessing the database on different computers at the same time. The app must consider the possibility of another user on another machine updating the database while the app is open. To do this, I designed the app to monitor the database file for changes. When a change occurs, it reloads the information from the database so that the user always sees the most recent information on the screen.

About a year ago, I started researching a replacement for our file server. It was over 5 years old and out of warranty, and its version of Windows (Windows Server 2008) was fast approaching end-of-life. Because we mainly used it for file storage, I recommended we get a Linux-based NAS device because they are cheaper than a full-blown Windows server.

The two major competitors in the NAS space are QNAP and Synology. I decided to go with QNAP because they offered a model that came with more RAM and an HDMI port (most NAS devices do not come with any video ports). My reasoning was that the HDMI port would allow me to administer the device just like a normal server if the network ever went down.

The library purchased it a few months ago, and I deployed it last week. It has been working well so far, but one hiccup I've encountered is that, ever since I deployed it, the class registration app I wrote has been slow to detect updates to the database file. Like, really slow. About 10 seconds slow.

I ran some tests and found that when my app updated the database, the "date modified" timestamp of the database file was slow to update. This was why my app was taking so long to detect any changes to it. But when I wrote a test program that just saved some content to a text file on the file server, the text file's timestamp would update immediately. Why, then, was the Access file behaving differently?

I thought that maybe the open-source library my app uses to interface with the Access database, called "Jackcess", was doing something different under the hood. I started a thread on their support forum and learned that they use Java's "RandomAccessFile" class to write to the database. This class allows you to update small sections of a file individually instead of re-writing an entire file from scratch, which is how most computer programs deal with updating files.

So, it looks like the culprit here is the new Linux-based NAS server. It seems to handle these kind of file updates differently than the old Windows server did. The simplest workaround I can think of would be to create some kind of dummy file and monitor that for changes instead of the database. The app would then update the dummy file whenever it makes a change to the database, which would cause the dummy file's "date modified" timestamp to update, signaling to other computers on the network that the database was just changed. That should hopefully get things back to normal!