-
- Description:
Gensim - Python Framework for Vector Space Modelling
Gensim is a Python library for Vector Space Modelling with very large corpora. Target audience is the Natural Language Processing (NLP) community.
Features
- all algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM)
-
simple interfaces (think java... then think again):
- easy to plug in your own input corpus/datastream (simple streaming API)
- easy to extend with other Vector Space algorithms (simple transformation API)
- efficient streaming implementations of popular algorithms, such as online Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections
- can run Latent Semantic Analysis on a cluster of computers (distributed computing)
- extensive documentation and tutorials
Reference example
>>> from gensim import corpora, models, similarities >>> >>> # load corpus iterator from a Matrix Market file on disk >>> corpus = corpora.MmCorpus('/path/to/corpus.mm') >>> >>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions) >>> lsi = models.LsiModel(corpus, numTopics=200) >>> >>> # convert the same corpus to latent space and index it >>> index = similarities.MatrixSimilarity(lsi[corpus]) >>> >>> # perform similarity query of another vector in LSI space against the whole corpus >>> sims = index[query]
- Changes to previous version:
- improved Latent Semantic Analysis (incremental SVD) performance: factorizing the English Wikipedia (3.1m documents, 400 factors) now takes 14h even in serial mode (i.e., on a single computer)
- several minor optimizations and bug fixes
- BibTeX Entry: Download
- Corresponding Paper BibTeX Entry: Download
- Supported Operating Systems: Platform Independent
- Data Formats: Agnostic
- Tags: Latent Semantic Analysis, Latent Dirichlet Allocation, Svd, Random Projections, Tfidf
- Archive: download here
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.