Skip to content
Advertisement

How to Remove Hidden Characters in PHP

I have following piece of code, which reads text files from a director. I have used a list of stopwords and after removing stopwords from the files when the words of these files along with their positions then there come extra blank characters in place of where stopword exist in the document.

For example, a file which reads like,

Department of Computer Science // A document

after removing stop word ‘of’ from the document when I loop through the document then following output comes out:

Department(0) (1) Computer(2) Science(3) //output

But blank space should not be there.

Here is the code:

<?php
$directory = "archive/";
$dir = opendir($directory);
while (($file = readdir($dir)) !== false) {
  $filename = $directory . $file;
  $type = filetype($filename);
  if ($type == 'file') {
    $contents = file_get_contents($filename);
    $texts = preg_replace('/s+/', ' ',  $contents);
    $texts = preg_replace('/[^A-Za-z0-9-n ]/', '', $texts);
    $text = explode(" ", $texts);
    $text = array_map('strtolower', $text);
    $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or", " ");
    $text = (array_diff($text,$stopwords));
    echo "<br><br>";
    $total_count = count($text);
    $b = -1;
   foreach ($text as $a=>$v)
   {
     $b++;
     echo $text[$b]. "(" .$b. ")" ." ";
   } 
 } 
}
closedir($dir); 
?>

Advertisement

Answer

Genuinely not 100% sure about the final output of the string position, but assuming you are placing that there for reference only. This test code using regex with preg_replace seems to work well.

header('Content-Type: text/plain; charset=utf-8');

// Set test content array.
$contents_array = array();
$contents_array[] = "Department of Computer Science // A document";
$contents_array[] = "Department of Economics // A document";

// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");

// Set a regex based on the stopwords.
$regex = '/(' . implode('b|', $stopwords) . 'b)/i';

foreach ($contents_array as $contents) {

  // Remove the stopwords.
  $contents = preg_replace($regex, '', $contents);

  // Clear out the extra whitespace; anything 2 spaces or more in a row.
  $contents = preg_replace('/s{2,}/', ' ', $contents);

  // Echo contents.
  echo $contents . "n";

}

The output is cleaned up & formatted like this:

Department Computer Science // document

Department Economics // document

So to integrate it into your code, you should do this. Note how I moved $stopwords & $regex outside of the while loop since it makes no sense to reset those values on each while loop iteration. Set it once outside of the loop & let the stuff in the loop just be focused on what you need there in the loop:

<?php
$directory = "archive/";
$dir = opendir($directory);

// Set the stopwords.
$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");

// Set a regex based on the stopwords.
$regex = '/(' . implode('b|', $stopwords) . 'b)/i';

while (($file = readdir($dir)) !== false) {
  $filename = $directory . $file;
  $type = filetype($filename);
  if ($type == 'file') {

    // Get the contents of the filename.
    $contents = file_get_contents($filename);

    // Remove the stopwords.
    $contents = preg_replace($regex, '', $contents);

    // Clear out the extra whitespace; anything 2 spaces or more in a row.
    $contents = preg_replace('/s{2,}/', ' ', $contents);

    // Echo contents.
    echo $contents;

 } 
}
closedir($dir); 
?>
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement