Skip to content
Advertisement

How to increase performance when creating lots relations in neo4j

I am working on a crawler to analyze the internal link structure of websites using a neo4j graph database in combination with spatie crawler.

The idea goes like this:

Whenever a URL is crawled, all Links will be extracted from the DOM. For all links, a node will be created and a relation foundOn->target is added.

// UrlCrawledListener.php

public function handle($event) 
{
//...
    // Extract all links on the page
    $linksOnPage = collect((new DomCrawlerService())->extractLinksFromHtml($event->getResponse()->getBody(), $event->getUrl()));
    // For all links, create nodes and add relation
    $linksOnPage->each(fn(Link $link) => $neo4jService->link($link, $event->getUrl()));
//...
}

// Neo4JService.php

public function link(Link $link, UriInterface $foundOnUrl): void
{
    $targetUrl = new Uri($link->getUri());

    if (!$this->doesNodeExist($targetUrl)) {
        $this->createNode($targetUrl);
    }

    if (!$this->doesNodeExist($foundOnUrl)) {
        $this->createNode($foundOnUrl);
    }

    // When this method call is disabled, the crawler is A LOT faster
    $this->createRelation($foundOnUrl, $targetUrl);
}

// ...

protected function createNode(UriInterface $uri): void
{
    // Todo: Add atttributes
    $this->runStatement(
        'USE ' . $this->getDB() . ' CREATE (n:URL {url: $url, url_hash: $hash})',
        [
            'url'  => $uri->__toString(),
            'hash' => CrawlUrl::getUrlHash($uri),
        ]
    );
}

// ...

protected function createRelation(UriInterface $from, UriInterface $to): void
{
    $this->runStatement(
        '
             USE ' . $this->getDB() . '
             MATCH (a:URL), (b:URL)
             WHERE a.url_hash = $fromURL AND b.url_hash = $toURL
             CREATE (a)-[rel:Link]->(b)
             ',
        [
            'fromURL' => CrawlUrl::getUrlHash($from),
            'toURL'   => CrawlUrl::getUrlHash($to),
        ]
    );
}


What I tried:

I tried adding an index to the nodes to improve performance on the MATCH query, but that did not have a noticeable impact:

$this->runStatement('USE ' . $this->getDB() . ' CREATE INDEX url_hash_index FOR (n:URL) ON (n.url_hash)');

I thought about creating them in a single query and not looping all found links, but I only find a documentation on how to create multiple nodes – but nothing on how to create multiple relations.

I also considered storing everything in another store first and then bulk import this store into neo4j. However, the documentation on csv imports uses exactly the same logic to create the relations, so that would not help neither:

// create relationships
LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
MATCH (e:Employee {employeeId: row.employeeId})
MATCH (c:Company {companyId: row.Company})
MERGE (e)-[:WORKS_FOR]->(c)

I already shifted the ->doesNodeExist() logic to SQL as this caused the same problem. The MATCH cypher seems to be very slow overall. But I can’t imagine that matching against a few hundred nodes can be that slow compared to a SQL database.

Do you have any suggestions on how to either improve the algorithm itself or the neo4j database structure or cypher query to improve performance?

Advertisement

Answer

You should replace:

    $targetUrl = new Uri($link->getUri());

    if (!$this->nodeExist($targetUrl)) {
        $this->createNode($targetUrl);
    }

    if (!$this->nodeExist($foundOnUrl)) {
        $this->createNode($foundOnUrl);
    }

    $this->createRelation($foundOnUrl, $targetUrl);

with something like:

$this->runStatement('
    USE ' . $this->getDB() . '
    MERGE (a:URL {url_hash: $fromURL})
    MERGE (b:URL {url_hash: $toURL})
    CREATE (a)-[rel:Link]->(b)',
    [
        'fromURL' => CrawlUrl::getUrlHash($from),
        'toURL'   => CrawlUrl::getUrlHash($to),
    ]
);

Make sure you define an index (or a uniqueness constraint) on (:URL {url_hash}) before running the program.

If it is still too slow, then you’ll indeed need to insert relationships by batch, with 1 query per batch (tweak the above query with UNWIND). You will need to experiment various batch sizes to determine what the best compromise is, in terms of import time and memory consumption (small batch => less memory => overall longer import time).

Side note: labels use PascalCase not UPPER_CASE, so URL should be written as Url (although the line is blurry when it comes to acronyms)

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement