I am working with Naive Bayesian classifier over PHP (http://www.xhtml.net/php/PHPNaiveBayesianFilter)
And there’s a list of words which can be ignored while training the system. Those words are not saved into the database and therefore not used for the classification. I would like to improve the system as much as I can so I was wondering if there’s any rule or list of typical words to ignore for this kind of systems.
I am currently ignoring words such as “to”, “and”, “the”, “for”, “since”, “which”, “what”, “who”… and some typical verbs such as “be”, “was”, “were”, “been”…etc.
Advertisement
Answer
You would be dealing with a lot of words …. mostly Adjective and Conjunctions and maybe verbs ….
Its a very long list you need to save as txt or import to your database ….. I suggest you just google and download directly
here are some links
http://www.momswhothink.com/reading/list-of-verbs.html
http://grammar.yourdictionary.com/parts-of-speech/conjunctions/conjunctions.html
http://www.smart-words.org/transition-words.html
http://www.momswhothink.com/reading/list-of-adjectives.html
The more word you have the better your your system works
Thanks 🙂