I’m trying to match all the quotes in the following example e-mail message:
> Don't forget to buy eggiweggs on the way home. I shall not. > Also remember to brush your shoes. Will do. > > > And clean up after the pigs. > > But I have no pigs. > Yes, you do. Your kids. I see what you mean. They sure make a mess.
That means I want to match these three strings:
> Don't forget to buy eggiweggs on the way home.
And:
> Also remember to brush your shoes.
And:
> > > And clean up after the pigs. > > But I have no pigs. > Yes, you do. Your kids.
I don’t understand how I can do this, since if I use the s
flag to span multiple lines, which is required for this, I cannot refer to ^
and $
to mean “beginning of line” and “end of line” — instead, they mean “beginning of string” and “end of string”.
So if I do: #^(> .+?)$#us
, it will match everything after/with the first quote.
And if I do: #^(> .+?)$#um
, it will match only the first quote’s first line and nothing else.
This is frustrating. I really have no idea how to solve it. I’ve searched online before asking and found zero even remotely relevant pages as usual.
Advertisement
Answer
With preg_match_all
:
preg_match_all('~^> .*(?:R> .*)*~m', $txt, $matches); $result = $matches[0];
(where R
is an alias for several newline sequences)
With preg_split
:
$result = preg_split('~^(?!> ).*R?~m', $txt, -1, PREG_SPLIT_NO_EMPTY);
that splits the string on each line that doesn’t start with >
.
To trim the newline at the end of each block, you can start this pattern with an optional R?
=> ~R?^(?!> ).*R?~m
or like that ~(?:R?^(?!> ).*)+R?~m
to eventually grab several lines at a time.
About R
:
R
is by default an alias for (?>rn|n|x0b|f|r|x85)
(any non-utf8 8bits characters sequences for a newline). In utf8 mode, with the u modifier or starting the pattern with (*UTF8)(*BSR_UNICODE)
, two other characters oustide of the ASCII range are added to the list: the line separator (U+2028), the paragraph separator (U+2029).
It’s handy when you don’t know which newline sequence is used in the string but slower than writing the exact newline sequence if you know it. You can restrict R
to (?>rn|n|r)
with the directive (*BSR_ANYCRLF)
at the start of the pattern.