-
- Description:
Harry is a small tool for comparing strings and measuring their similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus of Harry lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein distance and the Jaro-Winkler distance.
Harry is implemented using OpenMP, such that the computation time for a set of strings scales linear with the number of available CPU cores. Moreover, efficient implementations of several similarity measures, effective caching of similarity values and low-overhead locking further speedup the computation.
Harry complements the tool Sally that embeds strings in a vector space and allows computing vectorial similarity measures, such as the cosine distance and the bag-of-words kernel.
A tutorial is available here: http://www.mlsec.org/harry/tutorial.html
- Changes to previous version:
This release fixes the incorrect implementation of the bag distance.
- BibTeX Entry: Download
- Supported Operating Systems: Linux, Unix, Posix, Mac Os X
- Data Formats: Svmlight, Binary, Txt
- Tags: Sequence Analysis, String Kernels, Similarity Measures, String Distances
- Archive: download here
Other available revisons
-
Version Changelog Date 0.4.2 This release fixes the incorrect implementation of the bag distance.
April 16, 2016, 10:50:38 0.4.1 Minor bug fixes for libarchive code
January 3, 2016, 14:56:55 0.4.0 The new release supports measuring string similarity at the granularity of bytes, bits and tokens. A Python interface has been added. Several minor bugs have been fixed.
March 30, 2015, 14:03:12 0.3.2 Several minor bugfixes.
November 19, 2014, 20:24:21 0.3.1 This release feature several runtime improvements. Moreover, support for Soundex transformations and output modules for Matlab and JSON have been added. The distribution package also contains a new tutorial with examples.
October 22, 2014, 13:00:57 0.3 This new release implements 21 similarity measures for strings (Option -M). It supports splitting the computation of large similarity matrices into blocks and thus allows comparing large sets of strings (Option -s as well as -x and -y). The command-line interface has been improved and several minor bugs have been fixed.
July 30, 2014, 16:15:26 0.2 This release adds support for the Optimal Sequence Alignment distance (OSA) and fixes several minor bugs.
May 3, 2014, 18:48:38 0.1 Initial Announcement on mloss.org.
December 28, 2013, 12:34:47
Comments
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.