polyGLOB

polyGLOB: ID for Human Languages (Programmatically)

Frequent visitors will recognize that I have an interest in human languages, including an interest in how to manipulate them in software. bLaTheR is an example of a program that works with human languages, and it claims to be able to blather in any language you give it.

Part of manipulating a language successfully is being able to identify it, and then to apply the appropriate set of grammar rules to the text. I present polyGLOB, a code module to do just that. To get the source:

Download the polyGLOB tarball. (30Kbytes)

Table of Contents

What can it do?

The version 16 December 2011 distinguishes English, French, Italian, Spanish, Portuguese, and German.

How well does it work?

It returns a confidence factor in the range of 0 to 1, with the values toward the high end indicating more confidence. The confidence factor is determined by both the closeness of fit to the source language, and the length of the text; longer texts create more certainty, although one thousand letters is more than enough.

Do you plan to add more languages?

I will add a few more languages after appropriate testing. New versions of the source will be identified by the tarball's tooltip. I have thought seriously about adding Latin as a test case, but there is the choice of Classical Latin, Renaissance Latin, and dare we leave out Pig Latin. I have not given up the idea.

What do I need?

A C++ compiler, such as the GNU compiler, that is up-to-scratch enough to support some of the new ISO C++11 features.

What is in the tarball?

I thought about doing this in the style of Dr. D. Richard Hipp and having one grand source file, but that kind of delivery does not work as well for C++, particularly if one wants to provide interfaces to other languages. So, this is a typical bundle of C++ source, together with an external function that should allow the basic functionality to be obtained from any other language that supports a C-callable interface.

  • Two .h files
    1. chardecoder.h contains the declarations for converting characters sets.
    2. polyglob.h contains the declarations for the functions that id the language.
  • Two .cpp files
    1. chardecoder.cpp the implementation of the char-set flattener. There are many uses for this collection of functions, which is why I put it in a separate file.
    2. polyglob.cpp the implementation of the the language id functions, along with the externally callable function.
  • A makefile that builds a static library with the default name polyglob.a.1.

How do I build this?

  1. Put the tarball in an empty directory.
  2. Type tar -zxvf polyglob.tgz
  3. Type make.

You should be left with a file named polyglob.a.1 and a symbolic link named polyglob.a. (And your source code, of course.)

I am not a C++ programmer, what do I do?

The external function is named polyGLOB. Catchy, eh? It is invoked like this in C:

char ISOlanguageCode[3];
char * textToAnalyze = 
  "Now is the winter of our discontent, made glorious summer ....";
polyGLOB(textToAnalyze, &ISOlanguageCode);

or in PHP, something like this:

$ISOlanguageCode = "";
$textToAnalyze = 
  "I met a traveler from an antique land who said, 'Two vast ...";
polyGLOB($textToAnalyze, &$ISOlanguageCode);
After the call, ISOlanguageCode will contain something like en or fr, followed by the trailing zero byte so that the caller may safely treat it as a C-style string.