Bilingual Text Classificationhttp://mloss.orgUpdates and additions to Bilingual Text ClassificationenFri, 09 Apr 2010 15:13:08 -0000Bilingual Text Classification 0.1<html><p>The proliferation of multilingual documentation is a common phenomenon in many official institutions and private companies. In many cases, this textual information needs to be categorised by hand, entailing a time-consuming and arduous task. </p> <p>This software package implements a series of statistical models for bilingual text classificacion trained by the EM algorihtm. In this context data samples must be provided in the format cxy, where c is the class label, x is a text in a source language and y is its corresponding translation in a target language. Classification is performed according to the Bayes' rule: </p> <pre><code> c* = argmax_c p(c|x,y) = argmax_c p(x,y|c)·p(c) </code></pre><p>where p(x,y|c) is a class-conditional bilingual probability that is supposed to be generated from a t-component mixture model combining a translation and a language model: </p> <pre><code> p(x,y|c) = sum_t p(x,y,t|c) = sum_t p(x|y,t,c)·p(y|t,c)·p(t|c) </code></pre><p>Depending on the assumptions that we make about the translation model p(x|y,t,c) and the language model p(y|t,c), we will obtain different instantiations of bilingual text classifiers. However, two general approaches can be devised. </p> <p>The first approach is the modelisation of each language independently, making a naive crosslingual-independent assumption. Its corresponding implementation is "1g1gmc" that represents each language model as a unigram model: </p> <p>p(x|y,t,c) := p(x|t,c) := prod_i p(x_i|t,c) p(y|t,c) := prod_j p(y_j|t,c) </p> <p>The second approach is a natural evolution of the latter by taking into account the word correlation across languages. Here we combine the well-known IBM translation model with an n-gram model. Among these models, the following were implemented in this software package: </p> <p>1gM1mc) Unigram-M1 Mixture Model (M1 also known as IBM Model 1) </p> <p>p(x|y,t,c) := prod_j sum_i p(x_j,a_j|y_i,t,c) p(y|t,c) := prod_j p(y_j|t,c) </p> <p>1gM2mc) Unigram-M2 Mixture Model (M2 also known as IBM Model 2) </p> <p>p(x|y,t,c) := prod_j sum_i p(i|j,|y|,t,c)·p(x_j,a_j|y_i,t,c) p(y|t,c) := prod_j p(y_j|t,c) </p> <p>being |y|, the number of words in y. </p> <p>2gM2mc) Bigram-M2 Mixture Model </p> <p>p(x|y,t,c) := prod_j sum_i p(i|j,|y|,t,c)·p(x_j,a_j|y_i,t,c) p(y|t,c) := prod_j p(y_j|y_{j-1},t,c) </p> <p>The implementations m1g1gmc and m1gM1mc are straigthforward extensions to multi-label bilingual text classification. In this case, given a fixed number k of class labels to be assigned to each text, the k-most probable classes according to p(c|x,y) are returned. </p> <p>Single-label classifiers provide as output for each iteration of the EM algorithm and from left to right: </p> <ul> <li><p>Information about the number of mixture components and iteration. </p> </li> <li><p>Values of the parameters dependening on the model. </p> </li> <li><p>Log-likelihood, variation in log-likelihood, percentage of error rate and variation in percentage of error rate for the training, validation (when implemented) and test sets. </p> </li> </ul> <p>Multi-label classifiers, in addition to the information regarding the number of mixture components, iteration and parameter values, output from left to right: </p> <ul> <li><p>Log-likelihood and variation in log-likelihood for the training set. </p> </li> <li><p>Precision and recall for the test set given that the number of class labels requested to the classifier is: </p> </li> </ul> <p> 1) Only one class. </p> <p> 2) The average number of class labels per text in the training set. </p> <p> 3) Twice the average number of class labels. </p> <p> 4) The exact number of class labels for that text (oracle mode). </p></html>Jorge Civera, Alfons JuanFri, 09 Apr 2010 15:13:08 -0000 classificationnatural language processingnaive bayesemmixture modelsicml2010bilingual text classificationcrosslingual text classificationtext classificationalignment model