Project details for gensim

Logo gensim 0.7.8

by Radim - March 29, 2011, 09:06:23 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ]

view (14 today), download ( 3 today ), 1 subscription

Description:

Gensim - Python Framework for Vector Space Modelling

Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community.

Features

  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM),

  • Intuitive interfaces

  • easy to plug in your own input corpus/datastream (trivial streaming API)

  • easy to extend with other Vector Space algorithms (trivial transformation API)

  • Efficient implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections

  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.

  • extensive documentation and tutorials

Reference example

>>> from gensim import corpora, models, similarities
>>> 
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
>>> 
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, numTopics=200)
>>> 
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>> 
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]
Changes to previous version:
  • optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop.
  • optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h.
  • distributed LDA and LSA over a cluster of machines.
  • moved code to github, opened discussion group on Google groups
BibTeX Entry: Download
Corresponding Paper BibTeX Entry: Download
URL: Project Homepage
Supported Operating Systems: Platform Independent
Data Formats: Agnostic
Tags: Latent Semantic Analysis, Latent Dirichlet Allocation, Svd, Random Projections, Tfidf
Archive: download here

Other available revisons

Version Changelog Date
0.8.6
  • added the "hashing trick" (by Homer Strong)
  • support for adding target classes in SVMlight format (by Corrado Monti)
  • fixed problems with global lemmatizer object when running in parallel on Windows
  • parallelization of Wikipedia processing + added script version that lemmatizes the input documents
  • added class method to initialize Dictionary from an existing corpus (by Marko Burjek)
December 9, 2012, 13:15:16
0.8.5
  • numerous fixes to performance and stability
  • faster document similarity queries
  • document similarity server
  • full change set here
July 22, 2012, 23:42:28
0.8.0
  • faster document similarity queries
  • more optimizations to Latent Dirichlet Allocation (online LDA) and Latent Semantic Analysis (single pass online SVD) (Wikipedia experiments )
  • full change set here
June 21, 2011, 01:20:53
0.7.8
  • optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop.
  • optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h.
  • distributed LDA and LSA over a cluster of machines.
  • moved code to github, opened discussion group on Google groups
March 29, 2011, 09:06:23
0.7.7
  • optimizations to Latent Dirichlet Allocation (online LDA): 100 topics on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 11h on a laptop.
  • optimizations to Latent Semantic Analysis (single pass SVD): 400 factors on the English Wikipedia take 5.25h.
  • distributed LDA and LSA over a cluster of machines.
February 14, 2011, 05:26:25
0.7.5
  • optimizations to the single pass SVD algorithm: 400 factors on the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now take 5.25h on a standard laptop.
  • experiments comparing the one-pass algo with Halko et al.'s fast stochastic multi-pass SVD.
November 3, 2010, 16:58:21
0.7.3
  • added out-of-core stochastic SVD: getting the top 400 factors from the English Wikipedia (3.2M documents, 100K features, 0.5G non-zeros) now takes 2.5h on a standard laptop.
September 7, 2010, 17:23:28
0.7.1
  • improved Latent Semantic Analysis (incremental SVD) performance: factorizing the English Wikipedia (3.1m documents, 400 factors) now takes 14h even in serial mode (i.e., on a single computer)
  • several minor optimizations and bug fixes
August 28, 2010, 07:34:56
0.7.0
  • improved Latent Semantic Analysis (incremental SVD) performance: factorizing the English Wikipedia (3.1m documents) now takes 14h even in serial mode (i.e., on a single computer)
  • several minor optimizations and bug fixes
August 28, 2010, 05:58:31

Comments

No one has posted any comments yet. Perhaps you'd like to be the first?

Leave a comment

You must be logged in to post comments.