How to Remove Hidden Characters in PHP

Question

I have following piece of code, which reads text files from a director. I have used a list of stopwords and after removing stopwords from the files when the words of these files along with their positions then there come extra blank characters in place of where stopword exist in the document. For example, a f…

Accepted Answer

Genuinely not 100% sure about the final output of the string position, but assuming you are placing that there for reference only. This test code using regex with preg_replace seems to work well.header('Content-Type: text/plain; charset=utf-8');// Set test content array.$contents_array = array();$contents_array[] = "Department of Computer Science // A document";$contents_array[] = "Department of Economics // A document";// Set the stopwords.$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");// Set a regex based on the stopwords.$regex = '/(' . implode('b|', $stopwords) . 'b)/i';foreach ($contents_array as $contents) {  // Remove the stopwords.  $contents = preg_replace($regex, '', $contents);  // Clear out the extra whitespace; anything 2 spaces or more in a row.  $contents = preg_replace('/s{2,}/', ' ', $contents);  // Echo contents.  echo $contents . "n";}The output is cleaned up & formatted like this:  Department Computer Science // document    Department Economics // documentSo to integrate it into your code, you should do this. Note how I moved $stopwords & $regex outside of the while loop since it makes no sense to reset those values on each while loop iteration. Set it once outside of the loop & let the stuff in the loop just be focused on what you need there in the loop:<?php$directory = "archive/";$dir = opendir($directory);// Set the stopwords.$stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or");// Set a regex based on the stopwords.$regex = '/(' . implode('b|', $stopwords) . 'b)/i';while (($file = readdir($dir)) !== false) {  $filename = $directory . $file;  $type = filetype($filename);  if ($type == 'file') {    // Get the contents of the filename.    $contents = file_get_contents($filename);    // Remove the stopwords.    $contents = preg_replace($regex, '', $contents);    // Clear out the extra whitespace; anything 2 spaces or more in a row.    $contents = preg_replace('/s{2,}/', ' ', $contents);    // Echo contents.    echo $contents; } }closedir($dir); ?>

Advertisement

Answer