Vector Databases Made Easy

Author: Pat Lasserre

 

Photo by Pablo Arroyo on Unsplash

 

I was talking with Dmitry Kan, Principal AI Scientist at Silo AI, and he mentioned that one of the big benefits of our Elasticsearch and OpenSearch k-NN plugins is that they allow users to easily add a production-grade vector similarity search database to their search pipeline.

Leveraging Your Current Elasticsearch and OpenSearch Installation

Dmitry said that a lot of search teams already use Elasticsearch and AWS, so it would be easier for them to just use our Elasticsearch and OpenSearch k-NN plugins with their current Elasticsearch and OpenSearch installations rather than having to learn new software for one of the available vector databases out there. Additionally, as we mentioned in a previous post, users said that it would be simpler for them to use use an approximate nearest neighbor (ANN) solution directly in Elasticsearch rather than trying trying to integrate and productionize one of the popular open-source vector search libraries into their search platform.

Dmitry mentioned that while there are a few vector database options, such as Milvus, Pinecone, Vespa, and Weaviate, those options require a user to learn a new software platform. Many search teams are thin on resources, so having to learn new software in order to add vector similarity search to their application might not be their top choice.

While popular open-source vector search libraries, like NMSLIB or Faiss, provide good results for nearest neighbor benchmarks, they aren’t easy to productionize. They are just libraries, not full vector similarity search systems, so a user would still need to build a distributed system around them that scales. They would also have to manage the indexing of those distributed systems, which isn’t easy.

Also, building the graphs and indexes for graph-based libraries like NMSLIB is slow and memory-intensive. It is computationally expensive and takes a lot of memory to store each node’s neighbors. This limits the scalability of those libraries.
On the other end of this ease-of-use spectrum are our Elasticsearch and OpenSearch k-NN plugins. Installing the plugins is easy, and they allow for vector similarity search to be run as simply as any standard Elasticsearch query. The Elasticsearch plugin provides similarity search results in the standard Elasticsearch format, so a user doesn’t have to learn new software. They simply need to install the plugins, which use the existing query interfaces. This allows for a simple-to-use, ready-for-production vector similarity search solution that helps preserve end users’ engineering resources.

The GSI Elasticsearch k-NN plugin also addresses the previously mentioned scaling and indexing issues seen with many of the open-source vector search libraries because it leverages two of Elasticsearch’s key strengths — horizontal scaling and index management. Elasticsearch is a distributed system that easily scales horizontally, and since the plugin uses the core Elasticsearch dense_vector field type and index mapping, there is no need to reindex documents.

Extending Elasticsearch and OpenSearch

The k-NN plugins also extend the functionality of Elasticsearch and OpenSearch by addressing some of their limitations.

As was noted in a previous post, Elasticsearch doesn’t support approximate nearest neighbor search (ANN) — it does an exhaustive “match_all” search using cosine similarity to measure similarity against all items in the list. Cosine similarity is relatively computationally expensive, so that’s why they first run a restrictive query to limit the number of vectors they measure similarity against. This restricts the Elastic nearest neighbor search to very small datasets.

The GSI Elasticsearch k-NN plugin fills this gap in the Elastic offering by providing an approximate nearest neighbor solution that scales to billions of vectors. This opens up nearest neighbor search using Elasticsearch to many more applications and use cases since it is not limited to small databases like the base Elastic offering.

The GSI OpenSearch k-NN plugin also addresses one of the key limitations of OpenSearch — namely it’s lack of pre-filter support for nearest neighbor vector search. Pre-filtering on metadata is a common requirement in search applications. For example, product metadata such as item description, item title, category, and brand, are often used as pre-filters to a search query.

The developers of OpenSearch (previously Open Distro for Elasticsearch) state: “Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters will be applied on the results produced by the approximate nearest neighbor search.” This means that they only support post-filtering of the approximate nearest neighbor results.

Post-filtering, however, is problematic because it has a high likelihood of returning far fewer results than the intended k-nearest neighbors. For example, if a user wanted to return the 10 nearest neighbors (k=10) to an image in a visual search application, after applying a filter, such as price or color, perhaps only 2 of those 10 nearest neighbors remain. In fact, it could lead to zero results being returned. This leads to an unsatisfying user experience since very few, or no, relevant results are returned for a particular search query. To try to increase the probability of returning relevant results, you could increase the number of nearest neighbors returned (k), but this comes with a tradeoff of increased latency. Also, depending on how restrictive the filter is, this still might lead to few, or no, relevant results being surfaced.

GSI’s OpenSearch k-NN plugin doesn’t have this restriction — it allows for pre-filtering of documents prior to running an approximate k-NN vector similarity search.

Conclusion

Today, vector similarity search users have many options, including open-source vector search libraries, vector database platforms, and search and analytics platforms like Elasticsearch and OpenSearch.

The open-source vector search libraries require a user to handle the complex tasks needed to productionize them— for example, building a distributed system that scales horizontally and managing the indexing of those distributed systems.
The vector database platforms require a user to learn a new software platform, which isn’t an attractive proposition for a resource-constrained team.

Elasticsearch and OpenSearch have many users that have already invested the time and resources to build a robust, production-grade search application around those platforms. GSI’s Elasticsearch and OpenSearch k-NN plugins allow those users to leverage that investment by providing them to a way to easily add a production-grade vector similarity search database to those platforms while helping them to preserve their engineering resources.

©2024 GSI Technology, Inc. All Rights Reserved