martes, 21 de junio de 2011

Language Identification System: How to recognize other languages than English

When you are trying to understand humans the first step is to be able to hear/read them and the second one is to be able to indentify the language they use.
You can  understand how difficult it is, the first time that you are in a place where nobody speaks your language.

What's the problem?
When you have to process and understand a text with Text Engineering tools, the first step is to indentify the language in order to use the right set of data and tools. You won't use the same tokenizer, and processing tools if you are reading english or hebrew.
The problem gets more complicated if you are trying to recognize languages that look the same, but are different, like Italian and Spanish or even worst Catalan and Portugese. p.e: "Presunto" is written in spanish and portuguese exactly the same but means things completely different in both languages.

But how is it done? The classical approach is bassed in statistical methods and comparing the text with a training set defined in a n-gram. This solution based in dictionary is quite good if you are comparing a "big" seentence, more than 16 characters, if you are using only one language, p.e.:  not mixing french and spanish and italian and if you can spend seconds in every piece of text.
But our problem is that we can spend only a few milliseconds, we usually have less than 16 characthers and people tend to mix languages, i can't understand why, but people are bilingual even polyglots. Can you believe it?.

Our Experiment
This is the reason we started to research other solutions that could be quicker and that can handle multilinguism.
We've tried one solution based only in stopwords, one based in stems and a mixed solution. We tried those in order to handle more languages at a time, reducing the complexity of the dictionary and to be able to test 10-20 languages in about 10-20 ms.
In order to validate the algorithms and the speed we defined two sets, both with 1.000 phrases, one with short  (8-20 characters) and another with long phrases (50-160 characters) in 12 different langugages.
We compared our algorithm against google language detect api, the google one is not bad, is based in dictionary, is quick but can be discontinued at any time and you can loose control on your develepments.
To do the Stemming part we choose Snowball from Martin Porter, because is a very quick. One of the problems related to stemming is that does not apply to all the languages on earth.

And here are the results:


... those are the times:

                  Snowball   StopWords    Google      Hybrid
 Time     4-5 ms      1-2 ms           110 ms     5 ms  


With Hybrid we mix snowball and stopwords language detectors.


FL = "Frases largas" or "Long Sentences" 
FC = "Frases cortas" or "Short Sentences"
en = "Ingles"
es = "Castellano"
eu = "Euskera"

Conclusion:
We can detect the language with the same quality as google , in short sentences, and we can work 20 to 30 times faster than Google through their REST API.
We can detect languages not only in complete documents or corpus, but in paragraphs, even sentences.
We can quickly change our tools and adapt to different languages, solving the problems related to multilinguism.
If you want to share your doubts, knowledge or experiences about language detection systems, don't hesitate to contact me.

References:
http://www.fi.muni.cz/~xrehurek/cicling09_final.pdf
Aknowledgement:

Javier Jorge, did a great job in order to prepare the materials and analysis after this article.

1 comentario:

  1. Hi, interesting article. What does "hybrid" mean in this context? Would you be able to share the source code of your experiment? I might want to implement this as an alternative language identification method for the Apache Tika project.

    ResponderEliminar