Skip to content
Advertisement

Add domain to src attribute value if a relative path

I have a text variable which contains multiple images with a relative or absolute path. I need to check if the src attribute starts with http or https then ignore it, but in case it starts with / or something like abc/ then prepend a base url.

I tried like below:

<?php
$html = <<<HTML
<img src="docs/relative/url/img.jpg" />
<img src="/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />
HTML;

$base = 'https://example.com/';

$pattern = "/<img src="[^http|https]([^"]*)"/";
$replace = "<img src="" . $base . "${1}"";
echo $text = preg_replace($pattern, $replace, $html);

My output is:

<img src="https://example.com/ocs/relative/url/img.jpg" />
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />

Issue here: I got 99% result correct, but when the src attribute started with something like docs/ then first letter of it cut off. (please check first img src in output)

Output I needed is:

<img src="https://example.com/docs/relative/url/img.jpg" /><!--check this and compare with current result, you will get the difference -->
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />

Could any one help me to rectify it.

Advertisement

Answer

The following pattern will seek src attributes that do not start with http or https. Then for relative paths that begin with a forward slash, the leading slash will be removed before prepending the $base string to the src value.

Code: (Demo)

$base = 'https://example.com/';
echo preg_replace('~ src="(?!http)K/?~', $base, $html);

Output:

<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />

Breakdown:

~           #starting pattern delimiter
 src="      #match space, s, r, c, =, then "
(?!http)    #only continue matching if not https or http
K          #forget any previously matched characters so they are not destroyed by the replacement string
/?          #optionally match a forward slash
~           #ending pattern delimiter

As for your pattern, /<img src="[^http|https]([^"]*)"/:

  1. [^http|https] actually means “match a single character that is not from this list: |, h, t, p, and s. It could be simplified to [^|hpst] because the order of the listed characters in the “negated character class” is irrelevant and duplicating characters is meaningless. So you see, [^...] is not how you say “a string starts with something or somethingelse”.
  2. Capturing all remaining characters in a substring until the next double quote with the intent to use it again in the replacement is unnecessary. This is why I use K to pinpoint where $base should be injected instead of ([^"]*).

Furthermore, I always recommend the stability of a DOM parser when dealing with a valid HTML document. You can use DOMDocument with XPath to target the qualifying elements and modify the src attributes without regex.

Code: (Demo)

$dom = new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//img[not(starts-with(@src, 'http'))]") as $node) {
    $node->setAttribute('src', $base . ltrim($node->getAttribute('src'), '/'));
}
echo $dom->saveHTML();

A related answer: https://stackoverflow.com/a/48837947/2943403

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement