Bilingual Text Classification 0.1

Jorge Civera, Alfons Juan — Fri, 09 Apr 2010 15:13:08 -0000

The proliferation of multilingual documentation is a common phenomenon in many official institutions and private companies. In many cases, this textual information needs to be categorised by hand, entailing a time-consuming and arduous task.

This software package implements a series of statistical models for bilingual text classificacion trained by the EM algorihtm. In this context data samples must be provided in the format cxy, where c is the class label, x is a text in a source language and y is its corresponding translation in a target language. Classification is performed according to the Bayes' rule:

   c* = argmax_c p(c|x,y) = argmax_c p(x,y|c)·p(c)

where p(x,y|c) is a class-conditional bilingual probability that is supposed to be generated from a t-component mixture model combining a translation and a language model:

   p(x,y|c) = sum_t p(x,y,t|c) = sum_t p(x|y,t,c)·p(y|t,c)·p(t|c)

Depending on the assumptions that we make about the translation model p(x|y,t,c) and the language model p(y|t,c), we will obtain different instantiations of bilingual text classifiers. However, two general approaches can be devised.

The first approach is the modelisation of each language independently, making a naive crosslingual-independent assumption. Its corresponding implementation is "1g1gmc" that represents each language model as a unigram model:

The second approach is a natural evolution of the latter by taking into account the word correlation across languages. Here we combine the well-known IBM translation model with an n-gram model. Among these models, the following were implemented in this software package:

1gM1mc) Unigram-M1 Mixture Model (M1 also known as IBM Model 1)

p(x|y,t,c) := prod_j sum_i p(x_j,a_j|y_i,t,c) p(y|t,c) := prod_j p(y_j|t,c)

1gM2mc) Unigram-M2 Mixture Model (M2 also known as IBM Model 2)

p(x|y,t,c) := prod_j sum_i p(i|j,|y|,t,c)·p(x_j,a_j|y_i,t,c) p(y|t,c) := prod_j p(y_j|t,c)

being |y|, the number of words in y.

2gM2mc) Bigram-M2 Mixture Model

p(x|y,t,c) := prod_j sum_i p(i|j,|y|,t,c)·p(x_j,a_j|y_i,t,c) p(y|t,c) := prod_j p(y_j|y_{j-1},t,c)

The implementations m1g1gmc and m1gM1mc are straigthforward extensions to multi-label bilingual text classification. In this case, given a fixed number k of class labels to be assigned to each text, the k-most probable classes according to p(c|x,y) are returned.

Single-label classifiers provide as output for each iteration of the EM algorithm and from left to right:

Information about the number of mixture components and iteration.
Values of the parameters dependening on the model.
Log-likelihood, variation in log-likelihood, percentage of error rate and variation in percentage of error rate for the training, validation (when implemented) and test sets.

Multi-label classifiers, in addition to the information regarding the number of mixture components, iteration and parameter values, output from left to right:

Log-likelihood and variation in log-likelihood for the training set.
Precision and recall for the test set given that the number of class labels requested to the classifier is:

1) Only one class.

2) The average number of class labels per text in the training set.

3) Twice the average number of class labels.

4) The exact number of class labels for that text (oracle mode).

mloss.org Bilingual Text Classification

Bilingual Text Classification 0.1