Project details for gensim

Logo gensim 0.7.3

by Radim - September 7, 2010, 17:23:28 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ]

view ( today), download ( today ), 0 subscriptions


Gensim - Python Framework for Vector Space Modelling

Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community.


  • all algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM)
  • simple interfaces (think java... then think again):
    • easy to plug in your own input corpus/datastream (simple streaming API)
    • easy to extend with other Vector Space algorithms (simple transformation API)
  • efficient streaming implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections
  • can run Latent Semantic Analysis on a cluster of computers (distributed computing)
  • extensive documentation and tutorials

Reference example

>>> from gensim import corpora, models, similarities
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/')
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, numTopics=200)
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]
Changes to previous version:
  • added out-of-core stochastic SVD: getting the top 400 factors from the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now takes 2.5h on a standard laptop.
BibTeX Entry: Download
Corresponding Paper BibTeX Entry: Download
Supported Operating Systems: Platform Independent
Data Formats: Agnostic
Tags: Latent Semantic Analysis, Latent Dirichlet Allocation, Svd, Random Projections, Tfidf
Archive: download here


No one has posted any comments yet. Perhaps you'd like to be the first?

Leave a comment

You must be logged in to post comments.