Cédric Champeau's blog: JLangdetect available at google code

16 July 2009

Hi,

Several persons asked for it, so I took some time to create a google code project for JLangdetect. You’ll find it here :

I had not much time to improve it, so feel free to contribute. My ideas for future directions include :

remove ``irrevelant'' portions of texts from corpora to reduce the size of the n-gram trees
add ability to limit the tested text to a subset of test languages (useful if you know that your text is either in english or french, but your detector is configured with more languages)
add pre-filters to both learning and detection algorithms in order to address problems like case (if the corpus is a large well written text but the tested string is an uppercase title, then detection will likely be wrong)
improve detection thanks to pluggable add-ons like lexicon recognition, …

Take a look at my posts related to JLangdetect for more details.