Entradas

Mostrando entradas de junio, 2011

Language Identification System: How to recognize other languages than English

Imagen
When you are trying to understand humans the first step is to be able to hear/read them and the second one is to be able to indentify the language they use. You can  understand how difficult it is, the first time that you are in a place where nobody speaks your language. What's the problem? When you have to process and understand a text with Text Engineering tools, the first step is to indentify the language in order to use the right set of data and tools. You won't use the same tokenizer, and processing tools if you are reading english or hebrew. The problem gets more complicated if you are trying to recognize languages that look the same, but are different, like Italian and Spanish or even worst Catalan and Portugese. p.e: "Presunto" is written in spanish and portuguese exactly the same but means things completely different in both languages. But how is it done? The classical approach is bassed in statistical methods and comparing the text with a training se