Skip to content
Advertisement

How would I match all “quote blocks” in plaintext e-mail in PHP PCRE?

I’m trying to match all the quotes in the following example e-mail message:

> Don't forget to buy eggiweggs on the way home.

I shall not.

> Also remember to brush your shoes.

Will do.

> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.

I see what you mean. They sure make a mess.

That means I want to match these three strings:

> Don't forget to buy eggiweggs on the way home.

And:

> Also remember to brush your shoes.

And:

> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.

I don’t understand how I can do this, since if I use the s flag to span multiple lines, which is required for this, I cannot refer to ^ and $ to mean “beginning of line” and “end of line” — instead, they mean “beginning of string” and “end of string”.

So if I do: #^(> .+?)$#us, it will match everything after/with the first quote.

And if I do: #^(> .+?)$#um, it will match only the first quote’s first line and nothing else.

This is frustrating. I really have no idea how to solve it. I’ve searched online before asking and found zero even remotely relevant pages as usual.

Advertisement

Answer

With preg_match_all:

preg_match_all('~^> .*(?:R> .*)*~m', $txt, $matches);
$result = $matches[0];

(where R is an alias for several newline sequences)


With preg_split:

$result = preg_split('~^(?!> ).*R?~m', $txt, -1, PREG_SPLIT_NO_EMPTY);

that splits the string on each line that doesn’t start with > . To trim the newline at the end of each block, you can start this pattern with an optional R? => ~R?^(?!> ).*R?~m or like that ~(?:R?^(?!> ).*)+R?~m to eventually grab several lines at a time.


About R:
R is by default an alias for (?>rn|n|x0b|f|r|x85) (any non-utf8 8bits characters sequences for a newline). In utf8 mode, with the u modifier or starting the pattern with (*UTF8)(*BSR_UNICODE), two other characters oustide of the ASCII range are added to the list: the line separator (U+2028), the paragraph separator (U+2029).
It’s handy when you don’t know which newline sequence is used in the string but slower than writing the exact newline sequence if you know it. You can restrict R to (?>rn|n|r) with the directive (*BSR_ANYCRLF) at the start of the pattern.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement