AI summaryⓘ
The authors study a popular method called clustering-based Approximate Nearest Neighbor Search (ANNS), which groups data points to quickly find close matches. They point out there was no good way to tell in advance if this method will work well for a specific dataset, which they call 'searchability.' To solve this, they introduce two new measures: one that evaluates how good a clustering is (clustering-NSM) and another that assesses how naturally clusterable the dataset is (point-NSM). These measures help predict whether clustering-based ANNS will be accurate, using only information about which points are neighbors, not their exact distances. This approach works with different ways of measuring similarity, including inner product.
Approximate Nearest Neighbor SearchClusteringHigh-dimensional DataEuclidean SpaceClustering QualityClusterabilityInner ProductNearest NeighborsDistance Functions
Authors
Thomas Vecchiato, Sebastian Bruch
Abstract
Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.