I am working on a crawler to analyze the internal link structure of websites using a neo4j graph database in combination with spatie crawler.
The idea goes like this:
Whenever a URL is crawled, all Links will be extracted from the DOM. For all links, a node will be created and a relation foundOn->target
is added.
// UrlCrawledListener.php public function handle($event) { //... // Extract all links on the page $linksOnPage = collect((new DomCrawlerService())->extractLinksFromHtml($event->getResponse()->getBody(), $event->getUrl())); // For all links, create nodes and add relation $linksOnPage->each(fn(Link $link) => $neo4jService->link($link, $event->getUrl())); //... }
// Neo4JService.php public function link(Link $link, UriInterface $foundOnUrl): void { $targetUrl = new Uri($link->getUri()); if (!$this->doesNodeExist($targetUrl)) { $this->createNode($targetUrl); } if (!$this->doesNodeExist($foundOnUrl)) { $this->createNode($foundOnUrl); } // When this method call is disabled, the crawler is A LOT faster $this->createRelation($foundOnUrl, $targetUrl); } // ... protected function createNode(UriInterface $uri): void { // Todo: Add atttributes $this->runStatement( 'USE ' . $this->getDB() . ' CREATE (n:URL {url: $url, url_hash: $hash})', [ 'url' => $uri->__toString(), 'hash' => CrawlUrl::getUrlHash($uri), ] ); } // ... protected function createRelation(UriInterface $from, UriInterface $to): void { $this->runStatement( ' USE ' . $this->getDB() . ' MATCH (a:URL), (b:URL) WHERE a.url_hash = $fromURL AND b.url_hash = $toURL CREATE (a)-[rel:Link]->(b) ', [ 'fromURL' => CrawlUrl::getUrlHash($from), 'toURL' => CrawlUrl::getUrlHash($to), ] ); }
What I tried:
I tried adding an index to the nodes to improve performance on the MATCH
query, but that did not have a noticeable impact:
$this->runStatement('USE ' . $this->getDB() . ' CREATE INDEX url_hash_index FOR (n:URL) ON (n.url_hash)');
I thought about creating them in a single query and not looping all found links, but I only find a documentation on how to create multiple nodes – but nothing on how to create multiple relations.
I also considered storing everything in another store first and then bulk import this store into neo4j. However, the documentation on csv imports uses exactly the same logic to create the relations, so that would not help neither:
// create relationships LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row MATCH (e:Employee {employeeId: row.employeeId}) MATCH (c:Company {companyId: row.Company}) MERGE (e)-[:WORKS_FOR]->(c)
I already shifted the ->doesNodeExist()
logic to SQL as this caused the same problem. The MATCH
cypher seems to be very slow overall. But I can’t imagine that matching against a few hundred nodes can be that slow compared to a SQL database.
Do you have any suggestions on how to either improve the algorithm itself or the neo4j database structure or cypher query to improve performance?
Advertisement
Answer
You should replace:
$targetUrl = new Uri($link->getUri()); if (!$this->nodeExist($targetUrl)) { $this->createNode($targetUrl); } if (!$this->nodeExist($foundOnUrl)) { $this->createNode($foundOnUrl); } $this->createRelation($foundOnUrl, $targetUrl);
with something like:
$this->runStatement(' USE ' . $this->getDB() . ' MERGE (a:URL {url_hash: $fromURL}) MERGE (b:URL {url_hash: $toURL}) CREATE (a)-[rel:Link]->(b)', [ 'fromURL' => CrawlUrl::getUrlHash($from), 'toURL' => CrawlUrl::getUrlHash($to), ] );
Make sure you define an index (or a uniqueness constraint) on (:URL {url_hash})
before running the program.
If it is still too slow, then you’ll indeed need to insert relationships by batch, with 1 query per batch (tweak the above query with UNWIND
). You will need to experiment various batch sizes to determine what the best compromise is, in terms of import time and memory consumption (small batch => less memory => overall longer import time).
Side note: labels use PascalCase
not UPPER_CASE
, so URL
should be written as Url
(although the line is blurry when it comes to acronyms)