Project details for hca

Logo hca 0.5

by wbuntine - June 4, 2014, 04:08:12 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ]

view (14 today), download ( 2 today ), 3 comments, 4 subscriptions


Non-parametric topic models implemented using efficient Gibbs sampling on multi-core. Experiments reported at KDD-2014 (see and early theory from the ECML-PKDD 2011 paper cited.

Coded in C with no other dependencies. With modern C++11 atomic operations supports multi-core. No Chinese restaurant processes or stick breaking so fast (non-parametric methods 1-3 times slower than regular LDA with Gibbs, and marginal increase in memory). Input can be LdaC format, docword format, various Matlab style formats.

Implements HDP-LDA ala Teh, Jordan Beal and Blei (2006), HPYP-LDA, symmetric-symmetric, symmetric-asymmetric, asymmetric-symmetric, and asymmetric-symmetric priors ala Wallach, Mimno and McCallum (2009) with Pitman-Yor or Dirichlet processes. Burstiness modelling ala Doyle and Elkan (2009) can combine with any model above for even better performance. Full hyper-parameter fitting, or setting initially.

Estimation of various vectors (document and topic vectors). Diagnostics, control, restarts, test likelihood via document completion. Coherence calculations on results using PMI and normalised PMI. PMI and NPMI data available on request.

See for some data sets (and older versions).

Changes to previous version:

Implemented multi-core using atomic operations. Improved manual. Various extensions and bugs fixed. Also "-B" flag has a different argument, so watch it!

BibTeX Entry: Download
Corresponding Paper BibTeX Entry: Download
URL: Project Homepage
Supported Operating Systems: Linux, Macosx, Windows Under Cygwin
Data Formats: Ascii
Tags: Topic Modeling, Nonparametric Bayes, Multi Core
Archive: download here

Other available revisons

Version Changelog Date

Corrected the new normalised Gamma model for topics so it works with multicore. Improvements to documentation. Added an asymptotic version of the generalised Stirling numbers so it longer fails when they run out of bounds on bigger data.

April 26, 2016, 15:35:03

Corrections to diagnostics, documentation and topic report. Installed a new normalised Gamma model for topics. Added a tag cloud generator.

April 1, 2016, 05:07:17

Corrections to diagnostics and topic report. Correction to estimating alpha. Now estimating beta sometimes (when estimating phi).

September 10, 2014, 03:33:54

Modified command line -A and -B formats. Overhaul of diagnostics. Described changes in manual. Bug fixes: multi-core crashing when huge number of topics; -B when using number and fitting beta, beta sampling wasn't working; both now fixed.

August 6, 2014, 14:24:57

Implemented multi-core using atomic operations. Improved manual. Various extensions and bugs fixed. Also "-B" flag has a different argument, so watch it!

June 4, 2014, 04:08:12

Added example on using burstiness.

November 29, 2013, 03:16:11

Added example on using burstiness.

November 25, 2013, 05:15:27


Wray Buntine (on June 24, 2014, 06:21:54)

Noticed in this update hyper-parameter fitting of "beta" when using -B doesn't update the parameter. I'll have a new version out shortly along with a few other improvements to fix this.

Wray Buntine (on June 24, 2014, 06:29:59)

Get more details about the theory from the KDD 2014 paper. Will be presenting in New York!

Wray Buntine (on August 22, 2014, 23:19:31)

Tip for the speed freaks - diminishing returns after 10-16 cores due to memory thrashing. We keep it to 8 cores.

Also, am carefully studying Aaron Li's brilliant KDD 2014 paper to see about transferring his speedups into hca.

Leave a comment

You must be logged in to post comments.