-
- Description:
Gensim - Python Framework for Vector Space Modelling
Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community.
Features
All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM),
Intuitive interfaces
easy to plug in your own input corpus/datastream (trivial streaming API)
easy to extend with other Vector Space algorithms (trivial transformation API)
Efficient implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections
Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
extensive documentation and tutorials
Reference example
>>> from gensim import corpora, models, similarities >>> >>> # load corpus iterator from a Matrix Market file on disk >>> corpus = corpora.MmCorpus('/path/to/corpus.mm') >>> >>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions) >>> lsi = models.LsiModel(corpus, numTopics=200) >>> >>> # convert the same corpus to latent space and index it >>> index = similarities.MatrixSimilarity(lsi[corpus]) >>> >>> # perform similarity query of another vector in LSI space against the whole corpus >>> sims = index[query]
- Changes to previous version:
- optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop.
- optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h.
- distributed LDA and LSA over a cluster of machines.
- moved code to github, opened discussion group on Google groups
- BibTeX Entry: Download
- Corresponding Paper BibTeX Entry: Download
- Supported Operating Systems: Platform Independent
- Data Formats: Agnostic
- Tags: Latent Semantic Analysis, Latent Dirichlet Allocation, Svd, Random Projections, Tfidf
- Archive: download here
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.