Sunday, April 8, 2012

Extending the DateFormat class

I'm writing an SMTP server and one of the things that you have to do when writing an SMTP server is understand how dates in an email message are formatted. These rules are defined in RFC-5322, a document which provides details about the contents of SMTP email messages. RFC ("request for comment") documents are written by a standards organization known as the Internet Engineering Task Force (IETF). RFCs help to form what is essentially the "Bible" of the Internet--they lay down the rules for how many fundamental Internet technologies work. Some of these technologies include email, TCP, HTTP, and FTP.

The rules pertaining to dates are defined in two sections of RFC-5322. Section 3.3 (page 14) contains the most up-to-date specifications. This is what should be used when creating and sending new emails. The rules in Section 4.3 (page 33), on the other hand, describe the old standards which are now obsolete. These are included because an SMTP server must support them in order to maintain backwards compatibility with older SMTP servers.

To parse these dates in Java, at first I thought I could just use a single SimpleDateFormat object. But because of the complexity of the rules, that just wasn't possible. So, I created my own implementation of the DateFormat class to handle the complexity. The advantage to extending DateFormat is that it allows my code to plug nicely into the Java Date API, so I can call the parse() and format() methods just like I would with SimpleDateFormat.

import java.text.*;
import java.util.*;
import java.util.regex.*;

public class EmailDateFormat extends DateFormat {
  /**
   * The preferred format.
   */
  private final DateFormat longForm = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");

  /**
   * Day of the week is optional.
   * @see RFC-5322 p.50
   */
  private final DateFormat withoutDotw = new SimpleDateFormat("d MMM yyyy HH:mm:ss Z");

  /**
   * Seconds and day of the week are optional.
   * @see RFC-5322 p.49,50
   */
  private final DateFormat withoutDotwSeconds = new SimpleDateFormat("d MMM yyyy HH:mm Z");

  /**
   * Seconds are optional.
   * @see RFC-5322 p.49
   */
  private final DateFormat withoutSeconds = new SimpleDateFormat("EEE, d MMM yyyy HH:mm Z");

  /**
   * Determines if a date string has the day of the week.
   */
  private final Pattern dotwRegex = Pattern.compile("^[a-z]+,", Pattern.CASE_INSENSITIVE);

  /**
   * Determines if a date string has seconds.
   */
  private final Pattern secondsRegex = Pattern.compile("\d{1,2}:\d{2}:\d{2}");

  /**
   * Used for fixing obsolete two-digit years.
   * @see RFC-5322, p.50
   */
  private final Pattern twoDigitYearRegex = Pattern.compile("(\d{1,2} [a-z]{3}) (\d{2}) ", Pattern.CASE_INSENSITIVE);

  @Override
  public StringBuffer format(Date date, StringBuffer toAppendTo, FieldPosition fieldPosition) {
    return longForm.format(date, toAppendTo, fieldPosition);
  }

  @Override
  public Date parse(String source, ParsePosition pos) {
    //fix two-digit year
    Matcher m = twoDigitYearRegex.matcher(source);
    source = m.replaceFirst("$1 19$2 ");

    //remove extra whitespace
    //see RFC-5322, p.51
    source = source.replaceAll("\s{2,}", " "); //remove runs of multiple whitespace chars
    source = source.replaceAll(" ,", ","); //remove any spaces before the comma that comes after the day of the week
    source = source.replaceAll("\s*:\s*", ":"); //remove whitespace around the colons in the time

    //is the day of the week included?
    m = dotwRegex.matcher(source);
    boolean dotw = m.find();

    //are seconds included?
    m = secondsRegex.matcher(source);
    boolean seconds = m.find();

    if (dotw && seconds) {
      return longForm.parse(source, pos);
    } else if (dotw) {
      return withoutSeconds.parse(source, pos);
    } else if (seconds) {
      return withoutDotw.parse(source, pos);
    } else {
      return withoutDotwSeconds.parse(source, pos);
    }
  }
}

Looking at the source code of my EmailDateFormat class, the parse() method is designed to handle both the most recent syntax and the obsolete syntax of date strings. It basically does two things. First, it sanitizes the date string, removing unnecessary white space and converting two-digit years (which are now obsolete) to four-digit years. Second, it determines which of the many valid formats the date adheres to and then parses the date using an appropriate SimpleDateFormat object. The reason why so many SimpleDateFormat objects need to be created is that the "day of the week" and "second" parts of the date string are optional. Four separate SimpleDateFormat objects must be created to cover all possibilities because there's no way to define specific date fields as "optional" in the SimpleDateFormat class.

The format() method of the EmailDateFormat class is designed so that it will always create a date string that adheres to the most up-to-date standards.

Because of this class' complexity and its loose coupling from the rest of the application, it really lends itself to unit testing. So I wrote a unit test that feeds it date strings in various formats, and confirms that it parses them correctly. The unit test also makes sure that the format() method creates a date string that contains the most up-to-date syntax.

import static org.junit.Assert.*;
import java.util.*;
import org.junit.*;

public class EmailDateFormatTest {
  @Test
  public void parse() throws Exception {
    EmailDateFormat df = new EmailDateFormat();
    Calendar c;
    Date expected, actual;

    //+ day of the week
    //- seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //+ date of the week
    //+ seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //- day of the week
    //- seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //- date of the week
    //+ seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //single-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //two-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 10, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Tue, 10 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //obsolete timezone format (see RFC-5322, p.50)
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 EDT");
    assertEquals(expected, actual);

    //obsolete year format (see RFC-5322, p.50)
    c = Calendar.getInstance();
    c.set(1999, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 99 10:25:01 EDT");
    assertEquals(expected, actual);

    //with extra whitespacee
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun , 8   Apr 2012   10 :   25  -0400");
    assertEquals(expected, actual);
  }

  @Test
  public void format() throws Exception {
    EmailDateFormat df = new EmailDateFormat();

    //the long format should always be used

    //single-digit date
    Calendar c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    Date input = c.getTime();
    String expected = "Sun, 8 Apr 2012 10:25:01 -0400";
    String actual = df.format(input);
    assertEquals(expected, actual);

    //two-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 10, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    input = c.getTime();
    expected = "Tue, 10 Apr 2012 10:25:01 -0400";
    actual = df.format(input);
    assertEquals(expected, actual);
  }
}
Anyway, I was just proud of this, so I thought I'd share.

3 comments:

Daniel Warner said...

I am very interested in your post. The information in your post is very benefitable for me. Thanks for share this post.

Michael Angstadt said...

Thanks for make this comment.

Noumenon said...

This is amazingly clearly written, both the code and the explanation. Thank you.