On one of our sites were were running into a problem when we tried to pass HTML content from a database through an email obfuscation function to prevent spiders from scraping our clients’ email addresses. We quickly discovered that some of the longer pages were showing up completely blank. The preg_replace() function we were using to run the obfuscation code on email addresses was returning null. After some hunting I found the answer.
In PHP 5.2, Perl Compatible Regular Expressions (PCRE) introduced with little fanfare a PHP setting called backtrack_limit, which, for the first time, set a limit on the number of backtracks a regular expression could perform before it stops operating and reports an error. Unfortunately, when PCRE encounters an error of this type, it doesn’t report a notice or warning or error. All it does is return NULL, something that the preg family of functions typically never does. There were a lot of entries on the PHP.net site reporting this behavior as a bug, and sites that are regex heavy (like Wiki sites) scrambled to figure out WTF was going on.
The only way to actually determine that this type of PCRE error took place in your code is to call preg_last_error() after you’ve tried to run your regex. Of course, before PHP 5.2, backtrack errors were handled much more gracefully (if they were even triggered), by returning the original string that was passed to the regex function.
To get around this backtrack limit, if you’re running regex on large pages (or really long strings) is to increase the backtrack limit in your PHP.ini settings. I increased ours from 100,000 to 1,000,000. Of course, you still run the risk of producing an error on really, really long strings, and that’s why a second step you should take is to add better error handling any place where you might run a PCRE function on a really long string. Should an error be produced, it’s up to you how to handle it, whether that be returning the original string, or breaking your string up into smaller pieces and running them separately.
Ultimately the best thing one can do (and should always do) is optimize your regex as much as possible, and for some people that just means knowing when to use regex and when a simple str_replace() will suffice.