Logo gensim 0.8.5

by Radim - July 22, 2012, 23:42:28 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ]

Gensim - Python Framework for Vector Space Modelling

Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community.


  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM),

  • Intuitive interfaces

  • easy to plug in your own input corpus/datastream (trivial streaming API)

  • easy to extend with other Vector Space algorithms (trivial transformation API)

  • Efficient implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections

  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.

  • extensive documentation and tutorials

Reference example

>>> from gensim import corpora, models, similarities
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/')
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, num_topics=200)
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]
Changes to previous version:
  • numerous fixes to performance and stability
  • faster document similarity queries
  • document similarity server
  • full change set here
