In the ever-evolving landscape of information retrieval, Large Language Models and Chat Bots have taken center stage, ushering in a new era of functionality. At the heart of these advancements lies the pivotal concept of vectors and vector search.
Evolution of Search: From Keywords to Vectors
In the not-so-distant past, traditional search technologies like Lucene (Solr/ElasticSearch) relied primarily on keyword search capabilities. Content was broken into discrete documents. Documents were then tokenized to create a “bag of words.” While effective to some extent, this approach had limitations, as it failed to capture the context and meaning of words or the semantics, leading to issues of mismatch. It was very hard to distinguish between two texts like these by just using keywords.
Hannibal Lecter loved eating, long walks and his friends
Hannibal Lecter loved eating his friends and long walks
Vector search, powered by machine learning and natural language processing, represents a paradigm shift in information retrieval. This transformative approach captures the meaning and context of unstructured data, converting it into numeric representations. Leveraging approximate nearest neighbor (ANN) algorithms, vector search excels in semantic searches, delivering faster and more relevant results compared to traditional keyword searches.
Why Vector Search Matters: Imagine searching for something without knowing its name but having a clear description or understanding of its purpose. Vector search overcomes this challenge by enabling users to search based on meaning, not just keywords. By incorporating vector embeddings that extend beyond text to include videos, images, and audio, this approach enhances the search experience, offering a broader and more accurate retrieval of information.
How Vector Search Works: Unlike traditional methods that rely on keywords and word frequency, vector search engines operate in an embedding space, using distances to represent similarity. This innovative approach allows users to find related data by searching for the nearest neighbors of a query, transcending the limitations of conventional search methods.
Key Components of Vector Search
Vector Embedding: Numeric representations of data and related context stored in high-dimensional vectors, which can be trained on vast datasets for improved accuracy.
Similarity Score: The core idea that vectors of similar data and documents will exhibit similarity, enabling efficient identification of related content.
ANN Algorithm: Leveraging approximate nearest neighbor algorithms that sacrifice perfect accuracy for efficient execution in high-dimensional embedding spaces at scale.
In the example below three artists are represented in a 2-dimensional vector of Genre and Country. If we convert Genre and Country into numbers, it is easy to plot them on a 2-dimensional graph. As you can see, Pink Floyd and Jethro Tull are more similar to each other than they are to Snoop Dogg. Fo shizzle!
Using only two dimensions falls short in representing the multitude of concepts found in substantial amounts of text or images. However, the mathematical techniques to compute a similarity score, such as Cosine Similarity or Euclidean Distance extend to N dimensions quite efficiently. Embeddings can have a dimensionality of 100 or 1000 or even more, more than ample to encapsulate the majority of concepts within a text or image.
A big list of search algorithms including those used in vector search can be found here.
As we continue to witness the transformative impact of Large Language Models and innovative products, the role of vector search cannot be overstated. By embracing machine learning and numeric representations, vector search opens new frontiers in information retrieval, offering a more intuitive, efficient, and relevant search experience across various data mediums. As we look ahead, the continued evolution of vector search promises to redefine the way we interact with and extract knowledge from the vast sea of information available to us.
Some useful links