16 July 2009

Tags: java jlangdetect nlp

Hi,

Several persons asked for it, so I took some time to create a google code project for JLangdetect. You’ll find it here :

I had not much time to improve it, so feel free to contribute. My ideas for future directions include :

  • remove ``irrevelant'' portions of texts from corpora to reduce the size of the n-gram trees

  • add ability to limit the tested text to a subset of test languages (useful if you know that your text is either in english or french, but your detector is configured with more languages)

  • add pre-filters to both learning and detection algorithms in order to address problems like case (if the corpus is a large well written text but the tested string is an uppercase title, then detection will likely be wrong)

  • improve detection thanks to pluggable add-ons like lexicon recognition, …

Take a look at my posts related to JLangdetect for more details.