Skip to content
Advertisement

Insert space after semi-colon, unless it’s part of an HTML entity

I’m trying to insert a space after each semi-colon, unless the semi-colon is part of an HTML entity. The examples here are short, but my strings can be quite long, with several semi-colons (or none).

Coca‑Cola =>     Coca‑Cola  (‑ is a non-breaking hyphen)
Beverage;Food;Music => Beverage; Food; Music

I found the following regular expression that does the trick for short strings:

<?php
$a[] = 'Coca‑Cola';
$a[] = 'Beverage;Food;Music';
$regexp = '/(?:&#?w+;|[^;])+/';
foreach ($a as $str) {
    echo ltrim(preg_replace($regexp, ' $0', $str)).'<br>';
}
?>

However, if the string is somewhat large, the preg_replace above actually crashes my Apache server (The connection to the server was reset while the page was loading.) Add the following to the sample code above:

$a[] = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. '.
   'In blandit metus arcu. Fusce eu orci nulla, in interdum risus. '.
   'Maecenas ut velit turpis, eu pretium libero. Integer molestie '.
   'faucibus magna sagittis posuere. Morbi volutpat luctus turpis, '.
   'in pretium augue pellentesque quis. Cras tempor, sem suscipit '.
   'dapibus lacinia, dolor sapien ultrices est, eget laoreet nibh '.
   'ligula at massa. Cum sociis natoque penatibus et magnis dis '.
   'parturient montes, nascetur ridiculus mus. Phasellus nulla '.
   'dolor, placerat non sem. Proin tempor tempus erat, facilisis '.
   'euismod lectus pharetra vel. Etiam faucibus, lectus a '.
   'scelerisque dignissim, odio turpis commodo massa, vitae '.
   'tincidunt ante sapien non neque. Proin eleifend, lacus et '.
   'luctus pellentesque;odio felis.';

The code above (with the large string) crashes Apache but works if I run PHP on the command line.

Elsewhere in my program I use preg_replace on much larger strings without problem, so I’m guessing something in the regular expression overwhelms PHP/Apache.

So, is there a way to ‘fix’ the regex so it works on Apache with large strings or is there another, safer, way to do this?

I’m using PHP 5.2.17 with Apache 2.0.64 on Windows XP SP3, if it’s any help. (Unfortunately, upgrading either PHP or Apache is not an option for now.)

Advertisement

Answer

I would suggest this match expression:

b(?<!&)(?<!&#)w+;

…which matches a series of characters (letters, numbers, and underscore) which is not preceded by an ampersand (or an ampersand followed by a hash symbol) but which is followed by a semicolon.

it breaks down to mean:

b          # assert that this is a word boundary
(?<!        # look behind and assert that you cannot match
 &          # an ampersand
)           # end lookbehind
(?<!        # look behind and assert that you cannot match
 &#         # an ampersand followed by a hash symbol
)           # end lookbehind
w+         # match one or more word characters
;           # match a semicolon

replace with the string '$0 '

let me know if this doesn’t work for you

Of course, you could also use [a-zA-Z0-9] instead of w to avoid matching a semicolon, but I don’t think that would ever give you any trouble

Also, you might need to escape the hash symbol as well (because that is the regex comment symbol), like so:

b(?<!&)(?<!&#)w+;

EDIT Not sure, but I’m guessing that putting the word boundary at the beginning is going to make it a bit more efficient (and thus less likely to crash your server), so I changed that in the expressions and the break-down…

EDIT 2 … and a bit more info on why your expression might be making your server crash: Catastrophic Backtracking — I think this applies (?) hmmm…. good info nonetheless

FINAL EDIT if you are looking to only add a space after a semicolon if there is not already whitespace after it (i.e. add one in the case of pellentesque;odio but not in the case of pellentesque; odio), then add an additional lookahead at the end, which will prevent extra unnecessary spaces being added:

b(?<!&)(?<!&#)w+;(?!s)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement