For the Intervals API, we’re wrestling with issues surrounding data input validation. This recently became interesting when the matter of date validation came up. Ordinarily, Intervals allows many, many different date formats, dependent on the locale that the customer is using (for example, Intervals may expect the date format ‘mm/dd/yyyy’ for US customers, ‘dd.mm.yy’ for a customer in Austria).
For our API developers, we wanted to use a common, universal format, one that would be easily compatible with our application and database layers. For that we selected ISO 8601, which is great in terms of widespread use, but not so great in terms of how complicated its specifications are.
Generally, ISO 8601 looks something like ‘2009-05-20’ for dates and ‘2009-05-20 12:30:30’ for date/time combinations. These two examples encompass 98% of the user input we’re likely to encounter. But we wanted to make sure that if we told developers they could use ISO 8601 dates, our system would support it. Unfortunately, there’s not a lot of code out there for the validation of ISO 8601 dates (especially regular expressions), and most of the stuff that is out there doesn’t encompass the entirety of the ISO 8601 spec.
Starting off, here are some dates that the validator should match (all these are valid ISO 8601 dates to the best of my knowledge):
2009-12T12:34
2009
2009-05-19
2009-05-19
20090519
2009123
2009-05
2009-123
2009-222
2009-001
2009-W01-1
2009-W51-1
2009-W511
2009-W33
2009W511
2009-05-19
2009-05-19 00:00
2009-05-19 14
2009-05-19 14:31
2009-05-19 14:39:22
2009-05-19T14:39Z
2009-W21-2
2009-W21-2T01:22
2009-139
2009-05-19 14:39:22-06:00
2009-05-19 14:39:22+0600
2009-05-19 14:39:22-01
20090621T0545Z
2007-04-06T00:00
2007-04-05T24:00
2010-02-18T16:23:48.5
2010-02-18T16:23:48,444
2010-02-18T16:23:48,3-06:00
2010-02-18T16:23.4
2010-02-18T16:23,25
2010-02-18T16:23.33+0600
2010-02-18T16.23334444
2010-02-18T16,2283
2009-05-19 143922.500
2009-05-19 1439,55
And here are some of the strings that the validator should not match (ie. reject):
200905
2009367
2009-
2007-04-05T24:50
2009-000
2009-M511
2009M511
2009-05-19T14a39r
2009-05-19T14:3924
2009-0519
2009-05-1914:39
2009-05-19 14:
2009-05-19r14:39
2009-05-19 14a39a22
200912-01
2009-05-19 14:39:22+06a00
2009-05-19 146922.500
2010-02-18T16.5:23.35:48
2010-02-18T16:23.35:48
2010-02-18T16:23.35:48.45
2009-05-19 14.5.44
2010-02-18T16:23.33.600
2010-02-18T16,25:23:48,444
The code we came up with was the following:
^([\+-]?\d{4}(?!\d{2}\b))((-?)((0[1-9]|1[0-2])(\3([12]\d|0[1-9]|3[01]))?|W([0-4]\d|5[0-2])(-?[1-7])?|(00[1-9]|0[1-9]\d|[12]\d{2}|3([0-5]\d|6[1-6])))([T\s]((([01]\d|2[0-3])((:?)[0-5]\d)?|24\:?00)([\.,]\d+(?!:))?)?(\17[0-5]\d([\.,]\d+)?)?([zZ]|([\+-])([01]\d|2[0-3]):?([0-5]\d)?)?)?)?$
I guess I should add the caveat that this code doesn’t support the time interval or duration part of the ISO 8601 spec, so I didn’t include it. And it only supports dates or date/times, since right now we don’t have to deal with time input (for the Intervals API, all time is input in decimal format, rather than ISO 8601). But it should support everything else. Please let me know if this works for you or doesn’t, of if you can fine tune it.
Stumbled across your monster regexp for ISO 8601 validation a while, and found it useful in a project of mine. BTW, if you or others need to grab the individual parts of the date, here are the regexp matches you need to pay attention to:
1. Year
5. Month
7. Day
8. Week Number
9. Weekday
10. Ordinal date
15. Hours
16. Minutes (prefixed by “:”, use last two digits)
19. Seconds (prefixed by “:”, use last two digits)
21. Timezone, “Z” or offset
23. Hours Offset
24. Minutes Offset
The regexp incorrectly validates date in format YYYYMM – first one in the list of dates to be rejected.
(tested in GNU Octave 4.2.2, that uses PCRE to my knowledge).
Hi,
thanks for putting this regex together. May I ask you why it accepts a lower case z as timezone indicator?
kind regards,
Torsten.
I’m not sure, but I think the lower case z is there in case the time data was nor formatted correctly. If someone accidentally used a lower case z instead of upper case, for example. So it’s there to catch human error, because a lower case z may not be to spec, but it can be assumed it was meant to be a timezone indicator.