You are here

jExSLI

This tool is a simple text language identifier that can be used as a simple means to understand for example in which language a text input of your application was given. It's written in Java (compartible with all application written in Java 1.5 or later) and is distributed as a single jar file.

An initial list of languages contains 20 most commonly used languages and can be easily extended.

In this tool we applied very simple text categorization approach based on similarity of documents presented as vectors of terms with their tf*idf values. To exploit this idea we need to have each language presented as such vector and for this we use most frequent words of a language and their frequencies. According to our evaluation it's a reasonable approach that performs well not only for big texts but for phrases larger than 5 words.

The alternative software includes TextCat (most known Perl library for 69 languages) and lc4j (Java library) based on n-gram document classification. These tools are more sophisticated while jexsli allows user to do language identification in the simplest way.

To use language identification inside you Java code you just need to add jExSLI.jar as an external jar library into your building path, create an istance of class LanguageIdentifier and call its methodidentify(String text) to identify a language of your text. Method returns a name of a language from the list of available languages or null if the languages is not recognized.

import org.fbk.cit.hlt.langidentifier.LanguageIdentifier; ............. LanguageIdentifier languageIdentifier = new LanguageIdentifier(); System.out.println(languageIdentifier.identify("c'est la vie"));

Additionally you can tune language identifier parameters to have more robust classification if the input language is unknown. It is done by setting some threshold and removing some features that occur often in different languages. Simply use LanguageIdentifier(boolean setDefaultThreshold) constructor with parameter true.

The list of languages that can be recognized by the tool is specified in the file languages.txt included in jar-file. To add a new language first of all it should be added to this file with a proper name (that you want to use afterwards). Second step is to create a file that contains most frequent words of this language and their frequencies. The file should be named LanguageNameFreq.txt and be included in the directory freqs/ of the jar-file.

Use

  • $ jar uf jExSLI.jar languages.txt to add modified list of languages to the jar file and
  • $ jar uf jExSLI.jar freqs/LanguageNameFreq.txt to add the frequency file.

The frequency file can be created for any common language from Wikipedia pages using LanguageIdentifierUtils class and its methods createTopPagesFromEnglish andcreateMostFreqWords. Notice that these methods require additional library HTML parser (specifically htmlparser.jar) and use internet connection to load Wiki pages.

From command line you can do it as follows:

  1. Run $ java -cp jExSLI.jar eu.fbk.hlt.LanguageIdentifierUtils -1 languageWikiAbbr englishTopPagesFile to create a list of pages from wiki using a list of pages for english wiki. The latter you can find included in jar-file or from some other source. First parameter is a abbreviation used in wiki page address, e.g., 'en' for English, 'de' for German etc.
  2. Run $ java -cp jExSLI.jar eu.fbk.hlt.LanguageIdentifierUtils -2 languageWikiAbbr topPagesFile to produce a frequency file for a specified language. The file 'topPagesFile' should contain a list of wiki pages for given language (created in previous step), 'languageWikiAbbr' is an abbreviation of language as before. The output file is called 'languageWikiAbbrFreq.txt' and should be included in jar-file as discussed above (notice that the file must be renamed to agree with a language name in the list).

The jar file containing all necessary functionality can be downloaded here. The basic set of languages is set to English, Italian, Spanish and German. To specify your own set of languages available for application you need to change languages.txt file (see instructions above).

The full list of supported for this moments languages includes: Arabic, Catalan, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Ukranian.

The frequency models for languages are available for separate download here. Additionally the list of top wikipedia pages for these languages can be useful (around ~1000 pages).

jExSLI is licensed under Apache License 2.0.

It was developed as part of summer internship project at FBK HLT group by Kristina Gulordava.

Technology type: