Frequent visitors will recognize that I have an interest in human languages, including an interest in how to manipulate them in software. bLaTheR is an example of a program that works with human languages, and it claims to be able to blather in any language you give it.
Part of manipulating a language successfully is being able to identify it, and then to apply the appropriate set of grammar rules to the text. I present polyGLOB, a code module to do just that. To get the source:
Download the polyGLOB tarball. (30Kbytes)
The version 16 December 2011 distinguishes English, French, Italian, Spanish, Portuguese, and German.
It returns a confidence factor in the range of 0 to 1, with the values toward the high end indicating more confidence. The confidence factor is determined by both the closeness of fit to the source language, and the length of the text; longer texts create more certainty, although one thousand letters is more than enough.
I will add a few more languages after appropriate testing. New versions of the source will be identified by the tarball's tooltip. I have thought seriously about adding Latin as a test case, but there is the choice of Classical Latin, Renaissance Latin, and dare we leave out Pig Latin. I have not given up the idea.
I thought about doing this in the style of Dr. D. Richard Hipp and having one grand source file, but that kind of delivery does not work as well for C++, particularly if one wants to provide interfaces to other languages. So, this is a typical bundle of C++ source, together with an external function that should allow the basic functionality to be obtained from any other language that supports a C-callable interface.
The external function is named polyGLOB. Catchy, eh? It is invoked like this in C:
char ISOlanguageCode; char * textToAnalyze = "Now is the winter of our discontent, made glorious summer ...."; polyGLOB(textToAnalyze, &ISOlanguageCode);
or in PHP, something like this:
$ISOlanguageCode = ""; $textToAnalyze = "I met a traveler from an antique land who said, 'Two vast ..."; polyGLOB($textToAnalyze, &$ISOlanguageCode);After the call, ISOlanguageCode will contain something like en or fr, followed by the trailing zero byte so that the caller may safely treat it as a C-style string.