Skip to content
Advertisement

Finding duplicate column values in a CSV

I’m importing a CSV that has 3 columns, one of these columns could have duplicate records.

I have 2 things to check:

1. The field 'NAME' is not null and is a string
2. The field 'ID' is unique

So far, I’m parsing the CSV file, once and checking that 1. (NAME is valid), which if it fails, it simply breaks out of the while loop and stops.

I guess the question is, how I’d check that ID is unique?

I have fields like the following:

NAME,  ID,
Bob,   1,
Tom,   2,
James, 1,
Terry, 3,
Joe,   4,

This would output something like `Duplicate ID on line 3′

Thanks

P.S this CSV file has more columns and can have around 100,000 records. I have simplified it for a specific reason to solve the duplicate column/field

Thanks

Advertisement

Answer

I went assuming a certain type of design, as stripped out the CSV part, but the idea will remain the same :

<?php
  /* Let's make an array of 100,000 rows (Be careful, you might run into memory issues with this, issues you won't have with a CSV read line by line)*/
  $arr = [];
  for ($i = 0; $i < 100000; $i++)
    $arr[] = [rand(0, 1000000), 'Hey'];

  /* Now let's have fun */
  $ids = [];
  foreach ($arr as $line => $couple) {
    if ($ids[$couple[0]])
      echo "Id " . $couple[0] . " on line " . $line . " already used<br />";
    else
      $ids[$couple[0]] = true;
  }
?>

100, 000 rows aren’t that much, this will be enough. (It ran in 3 seconds at my place.)

EDIT: As pointed out, in_array is less efficient than key lookup. I’ve updated my code consequently.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement