I have following piece of code, which reads text files from a director. I have used a list of stopwords and after removing stopwords from the files when the words of these files along with their positions then there come extra blank characters in place of where stopword exist in the document.
For example, a file which reads like,
Department of Computer Science // A document
after removing stop word ‘of’ from the document when I loop through the document then following output comes out:
Department(0) (1) Computer(2) Science(3) //output
But blank space should not be there.
Here is the code:
<?php $directory = "archive/"; $dir = opendir($directory); while (($file = readdir($dir)) !== false) { $filename = $directory . $file; $type = filetype($filename); if ($type == 'file') { $contents = file_get_contents($filename); $texts = preg_replace('/s+/', ' ', $contents); $texts = preg_replace('/[^A-Za-z0-9-n ]/', '', $texts); $text = explode(" ", $texts); $text = array_map('strtolower', $text); $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or", " "); $text = (array_diff($text,$stopwords)); echo "<br><br>"; $total_count = count($text); $b = -1; foreach ($text as $a=>$v) { $b++; echo $text[$b]. "(" .$b. ")" ." "; } } } closedir($dir); ?>
Advertisement
Answer
Genuinely not 100% sure about the final output of the string position, but assuming you are placing that there for reference only. This test code using regex with preg_replace
seems to work well.
header('Content-Type: text/plain; charset=utf-8'); // Set test content array. $contents_array = array(); $contents_array[] = "Department of Computer Science // A document"; $contents_array[] = "Department of Economics // A document"; // Set the stopwords. $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or"); // Set a regex based on the stopwords. $regex = '/(' . implode('b|', $stopwords) . 'b)/i'; foreach ($contents_array as $contents) { // Remove the stopwords. $contents = preg_replace($regex, '', $contents); // Clear out the extra whitespace; anything 2 spaces or more in a row. $contents = preg_replace('/s{2,}/', ' ', $contents); // Echo contents. echo $contents . "n"; }
The output is cleaned up & formatted like this:
Department Computer Science // document
Department Economics // document
So to integrate it into your code, you should do this. Note how I moved $stopwords
& $regex
outside of the while
loop since it makes no sense to reset those values on each while
loop iteration. Set it once outside of the loop & let the stuff in the loop just be focused on what you need there in the loop:
<?php $directory = "archive/"; $dir = opendir($directory); // Set the stopwords. $stopwords = array("a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he", "in", "it","i","is", "its", "of", "on", "that", "the", "to","was", "were", "will", "with", "or"); // Set a regex based on the stopwords. $regex = '/(' . implode('b|', $stopwords) . 'b)/i'; while (($file = readdir($dir)) !== false) { $filename = $directory . $file; $type = filetype($filename); if ($type == 'file') { // Get the contents of the filename. $contents = file_get_contents($filename); // Remove the stopwords. $contents = preg_replace($regex, '', $contents); // Clear out the extra whitespace; anything 2 spaces or more in a row. $contents = preg_replace('/s{2,}/', ' ', $contents); // Echo contents. echo $contents; } } closedir($dir); ?>