String lang = detector.detectLang(text, false);
21 September 2008
Tags: detection java jlangdetect language nlp
Check out the latest version of the project
Hi,
Today I’m introducing a simple yet already working full Java language identifier. In natural language processing, there are tons of cases where you need to identify the language of a document first. While you’ll find most implementations in Perl, Java seems to be lacking in this domain.
JLangDetect is a pure Java implementation of a language detector. It provides a toolkit for training language recognition, and a simple implementation of a detector.
To identify the language of a text, just pass in a string to the sensor. It will respond a language code :
String lang = detector.detectLang(text, false);
A detector is created this way :
LangDetector detector = new LangDetector();
detector.register("fr", frenchTree);
detector.register("en", englishTree);
JLangDetect is based on n-gram tokenization. Basically, texts are tokenized with different token sizes. For example, given the text ``cat'', n-gram tokenization for 1 to 3 token sizes will produce the following tokens :
c
a
t
ca
at
cat
The idea is to tokenize a large set of documents in a given language and record token statistics. When you need to identify a language, then you’ll tokenize it the same way, and you’ll be able to score the input string against several token stats.
For now, JLangDetect stores token statistics in a memory-based structure called a gram tree. A gram tree will record the number of times a given n-gram is found in the document library (called corpus).
There are ways to obtain a more compact representation, that will be likely found in future releases, but this one works like a charm. I’ve made it Serializable so that storing/reading from file system pre-compiled gram trees is easy.
For my own tests, I’ve searched a corpus which would satisfy the best conditions for the JLangDetect algorithm. It includes :
the fact that the corpus size should be rather the same for every language
if not, the characters used in the corpus should be rather different (for example, it’s easy to discriminate between french and russian, because they use different characters, however, if you want to differenciate french and english, you need more accurate n-gram statistics).
A ``parallel corpus'' is a good candidate for that : it’s a corpus for which the text is simply the same, translated in different languages. I’ve made my test using the European Parliament Proceedings Parallel Corpus. It covers the proceedings of the european parliament from 1996 to 2006, which is up to 40 million words per language.
Translations exists for the following languages :
danish (da)
german (de)
greek (el)
english (en)
spanish (es)
finnish (fi)
french (fr)
italian (it)
dutch (nl)
portuguese (pt)
swedish (sv)
You’ll find a precompiled version of this corpus at the bottom of this page, if you wish to use it with JLangDetect. On my computer with a quad-core processor (Q9450), it took less than 10 minutes to process the whole corpus, I’ve optimized the importer for multi-core systems.
JLangDetect, with the previous corpus, does really well with both short and long texts. However, it will do better with longer texts (if anyone can tell me in which language is written the word ``chocolate''…).
Here’s a simple output of JLangDetect. OK means language has been detected properly. Error means not.
langof("un texte en français") = fr : OK langof("a text in english") = en : OK langof("un texto en español") = es : OK langof("un texte un peu plus long en français") = fr : OK langof("a text a little longer in english") = en : OK langof("a little longer text in english") = en : OK langof("un texto un poco mas longo en español") = es : OK langof("J'aime les bisounours !") = fr : OK langof("Bienvenue à Montmartre !") = fr : OK langof("Welcome to London !") = en : OK langof("un piccolo testo in italiano") = it : OK langof("een kleine Nederlandse tekst") = nl : OK langof("Matching sur des lexiques") = fr : OK langof("Matching on lexicons") = en : OK langof("Une première optimisation consiste à ne tester que les sous-chaînes de taille compatibles avec le lexique.") = fr : OK langof("A otimização é a primeira prova de que não sub-canais compatível com o tamanho do léxico.") = pt : OK langof("Ensimmäinen optimointi ei pidä testata, että osa-kanavien kanssa koko sanakirja.") = fi : OK langof("chocolate") = es : Error langof("some chocolate") = it : Error langof("eating some chocolate") = en : OK
Update Chocolate is an error as I tagged it as english in my test case, but one could expect it to be spanish. It’s here to demonstrate the limits of the system for very short texts. As for ``longo'', long time I’ve not written spanish. Bisounours is a french joke ;) Feel free to comment ;)
JLangDetect is licensed under Apache 2.0.
Binary : jlangdetect-0.1.jar
Source: jlangdetect-0.1-sources.jar
Javadoc: jlangdetect-0.1-javadoc.jar
Europarl pre-compiled corpus: ngrams-europarl.zip