Introduction to embeddings

Embeddings are a powerful tool provided by AI models, enabling semantic comparison of strings. This can be used for numerous applications, such as search, clustering, recommendations, anomaly detection, diversity measurement, and classification.

Understanding Embeddings

In essence, an embedding is a list of numbers that describes a text, given a specific model.

For example, OpenAI’s model produces a 1,536-element-long array of numbers, where each number captures some aspect of the text. Two arrays are considered similar to the extent that they have similar values for each element.

This similarity can be calculated even without understanding what each individual value represents, which is both the beauty and the mystery of embeddings. It is essential to note that arrays should only be compared if they come from the same model, as different models can produce vastly different arrays.

Generating Embeddings

To convert text into an embedding, several models can be used. Here is an example of using an open-source library, sentence-transformers, to generate embeddings in Python:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample sentences
sentences = ["This is an example sentence", "This is another one"]
embeddings = model.encode(sentences)

Comparing Embeddings

Once the embeddings are generated, they can be compared using various similarity measures such as cosine similarity or Euclidean distance. Here’s an example of comparing embeddings using cosine similarity in Python:

from sklearn.metrics.pairwise import cosine_similarity

# Assuming embeddings is a 2D array with each row being an embedding
similarity_matrix = cosine_similarity(embeddings)

# The similarity between the first and second sentence is:
print(similarity_matrix[0][1])

Storing Embeddings

After obtaining an embedding vector, it can either be used to search for similar strings or stored for future comparisons. If you plan on storing thousands of vectors, a dedicated vector database is recommended. This enables the rapid discovery of nearby vectors without having to compare against every vector every time. However, if you don’t have many vectors, they can be stored directly in a normal database, which allows for fetching and comparison within milliseconds using the vector database.