gensim 0.8.6

by Radim - December 9, 2012, 13:15:16 CET [ ]

view (3 today), download ( 0 today ), 0 subscriptions

Description:

Gensim - Python Framework for Vector Space Modelling

Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community.

Features

All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM),
Intuitive interfaces
easy to plug in your own input corpus/datastream (trivial streaming API)
easy to extend with other Vector Space algorithms (trivial transformation API)
Efficient implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections
Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
extensive documentation and tutorials

Reference example

>>> from gensim import corpora, models, similarities
>>> 
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
>>> 
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, num_topics=200)
>>> 
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>> 
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]

Changes to previous version:

added the "hashing trick" (by Homer Strong)
support for adding target classes in SVMlight format (by Corrado Monti)
fixed problems with global lemmatizer object when running in parallel on Windows
parallelization of Wikipedia processing + added script version that lemmatizes the input documents
added class method to initialize Dictionary from an existing corpus (by Marko Burjek)

BibTeX Entry: Download

Corresponding Paper BibTeX Entry: Download

Supported Operating Systems: Platform Independent

Data Formats: Agnostic

Tags: Latent Semantic Analysis, Latent Dirichlet Allocation, Svd, Random Projections, Tfidf

Archive: download here

Other available revisons

Version	Changelog	Date
0.8.6	added the "hashing trick" (by Homer Strong) support for adding target classes in SVMlight format (by Corrado Monti) fixed problems with global lemmatizer object when running in parallel on Windows parallelization of Wikipedia processing + added script version that lemmatizes the input documents added class method to initialize Dictionary from an existing corpus (by Marko Burjek)	December 9, 2012, 13:15:16
0.8.5	numerous fixes to performance and stability faster document similarity queries document similarity server full change set here	July 22, 2012, 23:42:28
0.8.0	faster document similarity queries more optimizations to Latent Dirichlet Allocation (online LDA) and Latent Semantic Analysis (single pass online SVD) (Wikipedia experiments ) full change set here	June 21, 2011, 01:20:53
0.7.8	optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop. optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h. distributed LDA and LSA over a cluster of machines. moved code to github, opened discussion group on Google groups	March 29, 2011, 09:06:23
0.7.7	optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop. optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h. distributed LDA and LSA over a cluster of machines.	February 14, 2011, 05:26:25
0.7.5	optimizations to the single pass SVD algorithm: 400 factors on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 5.25h on a standard laptop. experiments comparing the one-pass algo with Halko et al.'s fast stochastic multi-pass SVD.	November 3, 2010, 16:58:21
0.7.3	added out-of-core stochastic SVD: getting the top 400 factors from the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now takes 2.5h on a standard laptop.	September 7, 2010, 17:23:28
0.7.1	improved Latent Semantic Analysis (incremental SVD) performance: factorizing the English Wikipedia (3.1m documents, 400 factors) now takes 14h even in serial mode (i.e., on a single computer) several minor optimizations and bug fixes	August 28, 2010, 07:34:56
0.7.0	improved Latent Semantic Analysis (incremental SVD) performance: factorizing the English Wikipedia (3.1m documents) now takes 14h even in serial mode (i.e., on a single computer) several minor optimizations and bug fixes	August 28, 2010, 05:58:31

Comments

No one has posted any comments yet. Perhaps you'd like to be the first?

You must be logged in to post comments.

Manage

Details

RSS Feed for "gensim"

gensim 0.8.6

Gensim - Python Framework for Vector Space Modelling

Features

Reference example

Other available revisons

Comments

Leave a comment