Skip to content
Advertisement

RegEx (preg_match_all in PHP) to capture series of up to the first alphanumeric character

The problem here is the conflict between numbers and alphanumeric in the problem description.

Given the text:

<0><1><2><3><4><5><6><7><8><9><10><11><12><13><14><15><16><17><18>The next 11 keys can change the SWING from OFF (50%) to <19><20><21><22><23><24><25>80<26><27><28><29><30><31><32>% during arpeggiator or sequencer operation.<33><34>

I need to extract the following four groups:

<0><1><2><3><4><5><6><7><8><9><10><11><12><13><14><15><16><17><18>
<19><20><21><22><23><24><25>
<26><27><28><29><30><31><32>
<33><34>

Reason: we want to display this in a much more user-friendly way as…

[1]The next 11 keys can change the SWING from OFF (50%) to [2]80[3]% during arpeggiator or sequencer operation.[4]

Current code:

$pattern = '<[d<>' . REGSTART . REGEND . REGSTARTSQ . REGENDSQ . '{}]+>';
$numberofsupertags = preg_match_all('/(' . $pattern . ')/', $source, $superchunks);
echo '<pre>';
print_r($superchunks);
echo '</pre><br>';

(REGSTART/REGEND/REGSTARTSQ/REGENDSQ refer to other possible pairs of symbols, like 【】 or 〖〗 etc.)

gives three groups:

<0><1><2><3><4><5><6><7><8><9><10><11><12><13><14><15><16><17><18>
<19><20><21><22><23><24><25>80<26><27><28><29><30><31><32>
<33><34>

As you can see, the RegEx fails to take into account sequences of only numbers between tags.

I’ve tried lots of things:

$pattern = '([<|' . REGSTART . REGSTARTSQ . '|{]d+?[>|' . REGEND . REGENDSQ . | }])+';
$pattern = '<[d<>' . REGSTART . REGEND . REGSTARTSQ . REGENDSQ . '{}]+[>(?=d)|>]';

…but to no avail.

What is the correct solution and where do I go wrong? This looks really simple, but apparently it isn’t.

Advertisement

Answer

You can use

(?:<(?:{d+}|【d+】|〖d+〗|d+)>)+

See the regex demo. Details:

  • (?: – start of a non-capturing group:
    • – a char
    • (?:{d+}|【d+】|〖d+〗|d+) – one of the alternatives: { + one or more digits + }, + one or more digits + , + one or more digits + or one or more digits
    • – a char
  • )+ – one or more times.

See the PHP demo:

$source = '<0><1><2><3><4><5><6><7><8><9><10><11><12><13><14><15><16><17><18>The next 11 keys can change the SWING from OFF (50%) to <19><20><21><22><23><24><25>80<26><27><28><29><30><31><32>% during arpeggiator or sequencer operation.<33><34>';

$cnt = 0;
echo preg_replace_callback('~(?:<(?:{d+}|【d+】|〖d+〗|d+)>)+~u', function($m) use (&$cnt) {
    return '['. ++$cnt .']';
}, $source);
// => [1]The next 11 keys can change the SWING from OFF (50%) to [2]80[3]% during arpeggiator or sequencer operation.[4]
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement