Skip to content
Advertisement

Get text value of the last children of PHP DOM-XML takes too long time

This post is somewhat related to this post: Increase performance of PHP DOM-XML. Currently takes too long time . So it might be better to visit that post first before delve into this post

I have an array which contains 7000+ value

$arrayIds = [
    'A001',
    ...,
    'A7500'
];

This foreach loop gets text value inside <mrk> tags in a given XML file

$dom = new DOMDocument;
$dom->load('myxml.xml');

$xp = new DOMXPath($dom);

$data = [];

foreach ($arrayIds as $arrayId) {
    $expression = "//unit[@person-name="$arrayId"]/seg-source/mrk";
    $col = $xp->query($expression);

    if ($col && $col->length) {
        foreach ($col as $node) {
            $data[] = $node->nodeValue;
        }
    }
}

It takes approximately 45 seconds. I can’t wait any longer than 5 seconds

What is the fastest way to achieve this?

Segment of the XML file:

<unit person-name="A695" id="PTU-300" xml:space="preserve">
    <source xml:lang="en">This is Michael's speaking</source>
    <seg-source><mrk mid="0" mtype="seg">This is Michael's speaking</mrk></seg-source>
    <target xml:lang="id"><mrk mid="0" mtype="seg">This is Michael's speaking</mrk></target>
</unit>
<unit person-name="A001" id="PTU-4" xml:space="preserve">
    <source xml:lang="en">Related tutorials</source>
    <seg-source><mrk mid="0" mtype="seg">Related tutorials</mrk></seg-source>
    <target xml:lang="id"><mrk mid="0" mtype="seg">Related tutorials</mrk></target>
</unit>
...
<unit>
...
</unit>

Anyway, I’m doing this on an M1 Mac

Advertisement

Answer

There are a couple of things you can do here to speed up your processing. First, you are currently running an XPATH query against the entire document for each ID you are looking for. The larger your document is, and the more IDs you are searching for, the longer the process is going to take. It would be more efficient to loop through the document once, and test the person-name attribute of each unit element to see if it is in your list of IDs to extract data for. That change alone will give you a decent speedup.

However at that point, XPATH is not really doing much for you, so you might as well use XMLReader to parse the document efficiently without having to load the whole thing into memory. The code is more complex, so it’s more error-prone and difficult to understand, but if you need to efficiently process large XML documents, you need to use a streaming parser.

The speed difference between looping mechanisms in PHP is insignificant compared to the difference you could see between your current XPATH approach and using a streaming parser.

<?php

// Instantiate XML parser and open our file
$xmlReader = new XMLReader();
$xmlReader->open('test.xml');

// Array of person-name values we want to extract data for
$arrayIds = ['A001', 'A695'];

/*
 * Buffer for sec-source/mrk values
 * We want a sub array for each ID so we can sort the output by ID
 */
$buffer = [];
foreach($arrayIds as $currId)
{
    $buffer[$currId] = [];
}

/*
 * Flag to indicate whether or not the parser is in a unit that has
 * a person-name that we are looking for
 */
$validUnit = false;

/*
 * Flag indicating whether or not the parser is in a seg-source element.
 * Since both seg-source and target elements contain mrk elements, we need to
 * know when we are in a seg-source
 */
$inSegSource = false;

/*
 * We need to keep track of which person we are currently working with
 * so that we can populate the buffer
 */
$curPersonName = null;

// Parse the document
while ($xmlReader->read())
{
    // If we are at an opening element...
    if ($xmlReader->nodeType == XMLREADER::ELEMENT)
    {
        switch($xmlReader->localName)
        {
            case 'unit':
                // Pull the person-name
                $curPersonName = $xmlReader->getAttribute('person-name');

                /*
                 * If the value is in our array if ID, set the validUnit flag true,
                 * if not set the flag to false
                 */
                $validUnit =  (in_array($curPersonName, $arrayIds));
                break;
            case 'seg-source':
                // If we are opening a seg-source element, set the flag to true
                $inSegSource = true;
                break;
            case 'mrk':
                /*
                 * If we are in a valid unit AND inside a seg-source element,
                 * extract the element value and add it to the buffer
                 */
                if($validUnit && $inSegSource)
                {
                    $buffer[$curPersonName][] = $xmlReader->readString();
                }
                break;
        }
    }
    // If we are at a closing element...
    elseif($xmlReader->nodeType == XMLREADER::END_ELEMENT)
    {
        switch($xmlReader->localName)
        {
            case 'seg-source':
                // If we are closing a seg-source, set the flag to false
                $inSegSource = false;
                break;
        }
    }
}

$output = [];
foreach($buffer as $currId=>$currData)
{
    $output = array_merge($output, $currData);
}

print_r($output);
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement