logo Efficient Diversity Computation of Large Datasets

We propose two efficient algorithms for exploring topic diversity in large document corpora such as user generated content on the social web, bibliographic data, or other web repositories. Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Calculating diversity statistics requires averaging over the similarity of all object pairs, which, for large corpora, is prohibitive from a computational point of view. Our proposed algorithms overcome the quadratic complexity of the average pair-wise similarity computation, and allow for constant time (depending on dataset properties) or linear time approximation with probabilistic guarantees.

fresh index:
last release: 2 years ago, first release: 2 years ago
packaging: jar
get this artifact from: central
see this artifact on: search.maven.org

How much is this artifact used as a dependency in other Maven artifacts in Central repository and GitHub:

© Jiri Pinkas 2015 - 2018. All rights reserved. Admin login To submit bugs / feature requests please use this github page
related: JavaVids | Top Java Blogs | Java školení | 4npm - npm search | monitored using: sitemonitoring
Apache and Apache Maven are trademarks of the Apache Software Foundation. The Central Repository is a service mark of Sonatype, Inc.