Implementation of the DMV and CCM Parsershttp://mloss.orgUpdates and additions to Implementation of the DMV and CCM ParsersenTue, 24 Sep 2013 07:06:46 -0000Implementation of the DMV and CCM Parsers 0.2.0<html><p>====== Implementation of the DMV+CCM Parser ====== </p> <p>===== Introduction ===== </p> <p>This package includes implementations of the CCM, DMV and DMV+CCM parsers from Klein and Manning (2004), and code for testing them with the WSJ, Negra and Cast3LB corpuses (English, German and Spanish respectively). A detailed description of the parsers can be found in Klein (2005). </p> <p>This work was done as part of the PhD in Computer Science I am doing at FaMAF, Universidad Nacional de Cordoba, Argentina, under the supervision of Gabriel Infante-Lopez, with a research fellowship from CONICET. </p> <p>All the software is distributed under the [[|GNU GPL v3 license]]. </p> <p>===== About Version 0.2.0 ===== </p> <p>This version is aimed at reproducing some of the results from Klein and Manning (2004). The implemented DMV and DMV+CCM models are the versions with the one-side-first constraint. We couldn't reproduce yet the results for these models without the one-side-first constraint. </p> <p>The following table shows the performance of the models over the WSJ10 corpus: </p> <p>^ WSJ10 ^ UP ^ UR ^ UF1 ^ Dir ^ Undir ^ | DMV | 58.3 | 74.1 | 65.3 | 49.0 | 65.5 | | CCM | 64.3 | 81.6 | 71.9 | | DMV+CCM | 67.9 | 86.2 | 75.9 | 47.0 | 64.5 | </p> <p>DMV reaches the given values at the 20th iteration. CCM converges to the given values since the 40th iteration. DMV+CCM's preformance starts to decrease at the 9th training iteration, reaching a peak of UF1 = 76.1%, Dir = 47.2%. The results from the table for DMV+CCM were arbitrarily picked from the 10th iteration. </p> <p>===== Dependencies ===== </p> <pre><code>* ntlk: * lq-nlp-commons: </code></pre><p>===== Installation and Configuration ===== </p> <p>Extract lq-nlp-commons.tar.gz and lq-dmvccm.tar.gz into a folder and add the new folders to the Python PATH. For instance: </p> <pre><code># tar -zxvf lq-nlp-commons.tar.gz # tar -zxvf lq-dmvccm.tar.gz # export PYTHONPATH=`pwd`/lq-nlp-commons-0.1.0/:`pwd`/lq-dmvccm-0.1.0/ </code></pre><p>===== Usage ===== </p> <p>==== Quickstart ==== </p> <p>To train (with 10 iterations) and test the CCM parser with the WSJ10 do the following (replace 'my_wsj_path' with the path to the WSJ corpus): </p> <pre><code># python &gt;&gt;&gt; import wsj10 &gt;&gt;&gt; tb = wsj10.WSJ10(basedir='my_wsj_path') &gt;&gt;&gt; from dmvccm import ccm &gt;&gt;&gt; m = ccm.CCM(tb) &gt;&gt;&gt; m.train(10) &gt;&gt;&gt; m.test() </code></pre><p>To parse a sentence with the resulting parser do something like this: </p> <pre><code>&gt;&gt;&gt; s = 'DT NNP NN VBD DT VBZ DT JJ NN'.split() &gt;&gt;&gt; (b, p) = m.parse(s) &gt;&gt;&gt; b.brackets set([(6, 9), (1, 3), (4, 9), (4, 6), (3, 9), (0, 3), (7, 9)]) &gt;&gt;&gt; t = b.treefy(s) &gt;&gt;&gt; t.draw() </code></pre><p>==== Parser Instantiation, Training and Testing ==== </p> <p>The WSJ10 corpus is used by default to train the parsers. The WSJ10 corpus is automatically extracted from the WSJ corpus. You must place the WSJ corpus in a folder named wsj_comb (or edit lq-nlp-commons/, or read below). </p> <p>For instance, to train (with 10 iterations) and test the CCM parser with the WSJ10 do the following: </p> <pre><code># python &gt;&gt;&gt; from dmvccm import ccm &gt;&gt;&gt; m = ccm.CCM() &gt;&gt;&gt; m.train(10) &gt;&gt;&gt; m.test() </code></pre><p>To give the treebanks explicitly: </p> <pre><code>&gt;&gt;&gt; import wsj10, negra10, cast3lb10 &gt;&gt;&gt; tb1 = wsj10.WSJ10() &gt;&gt;&gt; tb2 = negra10.Negra10() &gt;&gt;&gt; tb3 = cast3lb10.Cast3LB10() &gt;&gt;&gt; m1 = ccm.CCM(tb1) &gt;&gt;&gt; m2 = ccm.CCM(tb2) &gt;&gt;&gt; m3 = ccm.CCM(tb3) </code></pre><p>The Negra corpus must be in a file negra-corpus/negra-corpus2.penn (in Penn format), and the Cast3LB corpus in a folder named 3lb-cast. </p> <p>To use alternative locations for the treebanks use the parameter basedir when creating the object. For instance: </p> <pre><code>&gt;&gt;&gt; import wsj10 &gt;&gt;&gt; tb = wsj10.WSJ10(basedir='my_wsj_path') </code></pre><p>(similarly for Negra and Cast3LB) </p> <p>When loaded for first time, the extracted corpuses are saved into files to avoid having to process the entire treebanks again. The files are saved in the NLTK data path ([0]), usually $HOME/nltk_data, to be loaded in future instantiations of the treebanks. </p> <p>The DMV and DMV+CCM parsers are in the classes dmv.DMV and dmvccm.DMVCCM. The current implementation of DMV+CCM has the one-side-first constraint. They are used the same way CCM is. For instance: </p> <pre><code>&gt;&gt;&gt; from dmvccm import dmv, dmvccm &gt;&gt;&gt; m1 = dmv.DMV() &gt;&gt;&gt; m1.train(10) &gt;&gt;&gt; m1.test() &gt;&gt;&gt; m2 = dmvccm.DMVCCM() &gt;&gt;&gt; m2.train(10) &gt;&gt;&gt; m2.test() </code></pre><p>Take into account that training DMV and DMV+CCM is much slower than training CCM. A single training step can take more than 20 minutes. </p> <p>==== Parser Usage ==== </p> <p>Once you have a trained instance of CCM, DMV or DMV+CCM, you can parse sentences with the parse() method. You may give a list of words or an instance of the sentence.Sentence class from lq-nlp-commons. The parse() method returns a pair (b, p), where b is an instance of bracketing.Bracketing and p is the probability of the bracketing. You can get the set of brackets from b.brackets or convert the bracketing to a tree with b.treefy(). </p> <p>For instance: </p> <pre><code>&gt;&gt;&gt; from dmvccm import ccm &gt;&gt;&gt; m = ccm.CCM() &gt;&gt;&gt; m.train(2) &gt;&gt;&gt; s = 'DT NNP NN VBD DT VBZ DT JJ NN'.split() &gt;&gt;&gt; (b, p) = m.parse(s) &gt;&gt;&gt; b.brackets set([(6, 9), (1, 3), (4, 9), (4, 6), (3, 9), (0, 3), (7, 9)]) &gt;&gt;&gt; t = b.treefy(s) &gt;&gt;&gt; t.draw() </code></pre><p>In the case of DMV and DMV+CCM you can also use the method dep_parse() to get the dependency structure parsed by these models (Klein, 2005). </p> <p>For instance: </p> <pre><code>&gt;&gt;&gt; from dmvccm import dmv &gt;&gt;&gt; m = dmv.DMV() &gt;&gt;&gt; m.train(2) &gt;&gt;&gt; s = 'DT NNP NN VBD DT VBZ DT JJ NN'.split() &gt;&gt;&gt; (t, p) = m.dep_parse(s) &gt;&gt;&gt; t.draw() </code></pre><p>==== References ==== </p> <p>Franco M. Luque. Una Implementación del Modelo DMV+CCM para Parsing No Supervisado (2011). 2do Workshop Argentino en Procesamiento de Lenguaje Natural, Córdoba, Argentina. </p> <p>Klein, D. and Manning, C. D. (2004). Corpus-based induction of syntactic structure: Models of dependency and constituency. In ACL, pages 478-485. </p> <p>Klein, D. (2005). The Unsupervised Learning of Natural Language Structure. PhD thesis, Stanford University. </p></html>Franco M. LuqueTue, 24 Sep 2013 07:06:46 -0000 language processingparsingunsupervised learning