-
- Description:
Gensim - Python Framework for Vector Space Modelling
Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community.
Features
All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM),
Intuitive interfaces
easy to plug in your own input corpus/datastream (trivial streaming API)
easy to extend with other Vector Space algorithms (trivial transformation API)
Efficient implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections
Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
extensive documentation and tutorials
Reference example
>>> from gensim import corpora, models, similarities >>> >>> # load corpus iterator from a Matrix Market file on disk >>> corpus = corpora.MmCorpus('/path/to/corpus.mm') >>> >>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions) >>> lsi = models.LsiModel(corpus, num_topics=200) >>> >>> # convert the same corpus to latent space and index it >>> index = similarities.MatrixSimilarity(lsi[corpus]) >>> >>> # perform similarity query of another vector in LSI space against the whole corpus >>> sims = index[query]
- Changes to previous version:
- added the "hashing trick" (by Homer Strong)
- support for adding target classes in SVMlight format (by Corrado Monti)
- fixed problems with global lemmatizer object when running in parallel on Windows
- parallelization of Wikipedia processing + added script version that lemmatizes the input documents
- added class method to initialize Dictionary from an existing corpus (by Marko Burjek)
- BibTeX Entry: Download
- Corresponding Paper BibTeX Entry: Download
- Supported Operating Systems: Platform Independent
- Data Formats: Agnostic
- Tags: Latent Semantic Analysis, Latent Dirichlet Allocation, Svd, Random Projections, Tfidf
- Archive: download here
Other available revisons
-
Version Changelog Date 0.8.6 - added the "hashing trick" (by Homer Strong)
- support for adding target classes in SVMlight format (by Corrado Monti)
- fixed problems with global lemmatizer object when running in parallel on Windows
- parallelization of Wikipedia processing + added script version that lemmatizes the input documents
- added class method to initialize Dictionary from an existing corpus (by Marko Burjek)
December 9, 2012, 13:15:16 0.8.5 - numerous fixes to performance and stability
- faster document similarity queries
- document similarity server
- full change set here
July 22, 2012, 23:42:28 0.8.0 - faster document similarity queries
- more optimizations to Latent Dirichlet Allocation (online LDA) and Latent Semantic Analysis (single pass online SVD) (Wikipedia experiments )
- full change set here
June 21, 2011, 01:20:53 0.7.8 - optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop.
- optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h.
- distributed LDA and LSA over a cluster of machines.
- moved code to github, opened discussion group on Google groups
March 29, 2011, 09:06:23 0.7.7 - optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop.
- optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h.
- distributed LDA and LSA over a cluster of machines.
February 14, 2011, 05:26:25 0.7.5 - optimizations to the single pass SVD algorithm: 400 factors on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 5.25h on a standard laptop.
- experiments comparing the one-pass algo with Halko et al.'s fast stochastic multi-pass SVD.
November 3, 2010, 16:58:21 0.7.3 - added out-of-core stochastic SVD: getting the top 400 factors from the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now takes 2.5h on a standard laptop.
September 7, 2010, 17:23:28 0.7.1 - improved Latent Semantic Analysis (incremental SVD) performance: factorizing the English Wikipedia (3.1m documents, 400 factors) now takes 14h even in serial mode (i.e., on a single computer)
- several minor optimizations and bug fixes
August 28, 2010, 07:34:56 0.7.0 - improved Latent Semantic Analysis (incremental SVD) performance: factorizing the English Wikipedia (3.1m documents) now takes 14h even in serial mode (i.e., on a single computer)
- several minor optimizations and bug fixes
August 28, 2010, 05:58:31
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.