-
- Description:
This is a tool for retrieving nearest neighbors and clustering of large categorical data sets represented in transactional form. The clustering is achieved via a locality-sensitive hashing of categorical datasets for speed and scalability. The locality-sensitive hashing method implemented is described in the video lectures under www.mmds.org (Chapter 3). Information needed for LSH, such as shingles/tokens, MinHash signatures, band hashes to buckets are stored in several database tables. Information needed for clustering purposes, such as the most significant pairwise object similarities and density-based similarities are also stored in tables.
These are the tools the software comes with: * ReadInput.py supports updating the database with a CSV text file of new objects in transactional (market-basket) format. This tool converts an object to its shingle strings and from there to unsigned integer tokens (to save space) using the CRC32 checksum computation. Then the tokens get converted to signature vectors using MinHashing - specifically, the murmurhash_32 hashing function, which is a part of scipy. The signature vectors then get grouped into bands and each band gets hashed to a bucket. All arrays are stored as numpy vactors. For the constants used, see the file Constants.py. * RetrieveNN.py supports fast retrieval of the most similar objects to a query item, based on the Jaccard similarity Coefficient. * Cluster.py performs the clustering and outputs the clusters in GraphML format, such that they can be input into a graph visualization tool. The edges are weighted by Jaccard similarity Coefficient.
An early version of the fast database-based retrieval of nearest neighbors and clustering in large categorical datasets was published in: Bill Andreopoulos et al. Efficient Layered Density-based Clustering of Categorical Data. Elsevier Journal of Biomedical Informatics, 2009.
The Plants dataset from the UCI Machine Learning repository is included for testing. This is not a very large dataset as it has just 34,781 objects, since it is meant for testing only. It is in transactional (market-basket) format, which means this tool can easily be applied on other categorical or text datasets. https://archive.ics.uci.edu/ml/datasets/Plants
- Changes to previous version:
Initial Announcement on mloss.org.
- BibTeX Entry: Download
- Supported Operating Systems: Linux, Mac Os X
- Data Formats: Csv
- Tags: Clustering, Hashing, Categorical, Data, Locality, Sensitive
- Archive: download here
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.