
 Description:
The proliferation of multilingual documentation is a common phenomenon in many official institutions and private companies. In many cases, this textual information needs to be categorised by hand, entailing a timeconsuming and arduous task.
This software package implements a series of statistical models for bilingual text classificacion trained by the EM algorihtm. In this context data samples must be provided in the format cxy, where c is the class label, x is a text in a source language and y is its corresponding translation in a target language. Classification is performed according to the Bayes' rule:
c* = argmax_c p(cx,y) = argmax_c p(x,yc)·p(c)
where p(x,yc) is a classconditional bilingual probability that is supposed to be generated from a tcomponent mixture model combining a translation and a language model:
p(x,yc) = sum_t p(x,y,tc) = sum_t p(xy,t,c)·p(yt,c)·p(tc)
Depending on the assumptions that we make about the translation model p(xy,t,c) and the language model p(yt,c), we will obtain different instantiations of bilingual text classifiers. However, two general approaches can be devised.
The first approach is the modelisation of each language independently, making a naive crosslingualindependent assumption. Its corresponding implementation is "1g1gmc" that represents each language model as a unigram model:
p(xy,t,c) := p(xt,c) := prod_i p(x_it,c) p(yt,c) := prod_j p(y_jt,c)
The second approach is a natural evolution of the latter by taking into account the word correlation across languages. Here we combine the wellknown IBM translation model with an ngram model. Among these models, the following were implemented in this software package:
1gM1mc) UnigramM1 Mixture Model (M1 also known as IBM Model 1)
p(xy,t,c) := prod_j sum_i p(x_j,a_jy_i,t,c) p(yt,c) := prod_j p(y_jt,c)
1gM2mc) UnigramM2 Mixture Model (M2 also known as IBM Model 2)
p(xy,t,c) := prod_j sum_i p(ij,y,t,c)·p(x_j,a_jy_i,t,c) p(yt,c) := prod_j p(y_jt,c)
being y, the number of words in y.
2gM2mc) BigramM2 Mixture Model
p(xy,t,c) := prod_j sum_i p(ij,y,t,c)·p(x_j,a_jy_i,t,c) p(yt,c) := prod_j p(y_jy_{j1},t,c)
The implementations m1g1gmc and m1gM1mc are straigthforward extensions to multilabel bilingual text classification. In this case, given a fixed number k of class labels to be assigned to each text, the kmost probable classes according to p(cx,y) are returned.
Singlelabel classifiers provide as output for each iteration of the EM algorithm and from left to right:
Information about the number of mixture components and iteration.
Values of the parameters dependening on the model.
Loglikelihood, variation in loglikelihood, percentage of error rate and variation in percentage of error rate for the training, validation (when implemented) and test sets.
Multilabel classifiers, in addition to the information regarding the number of mixture components, iteration and parameter values, output from left to right:
Loglikelihood and variation in loglikelihood for the training set.
Precision and recall for the test set given that the number of class labels requested to the classifier is:
1) Only one class.
2) The average number of class labels per text in the training set.
3) Twice the average number of class labels.
4) The exact number of class labels for that text (oracle mode).
 Changes to previous version:
Initial Announcement on mloss.org.
 BibTeX Entry: Download
 Corresponding Paper BibTeX Entry: Download
 Supported Operating Systems: Agnostic
 Data Formats: Txt
 Tags: Multilabel Classification, Natural Language Processing, Naive Bayes, Em, Mixture Models, Icml2010, Bilingual Text Classification, Crosslingual Text Classification, Text Classification, Alignment Model
 Archive: download here
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.