This notebook takes you through a simple flow to download some data, embed it, and then index and search it using a selection of vector databases. This is a common requirement for customers who want to store and search our embeddings with their own data in a secure environment to support production use cases such as chatbots, topic modelling and more.
A vector database is a database made to store, manage and search embedding vectors. The use of embeddings to encode unstructured data (text, audio, video and more) as vectors for consumption by machine-learning models has exploded in recent years, due to the increasing effectiveness of AI in solving use cases involving natural language, image recognition and other unstructured forms of data. Vector databases have emerged as an effective solution for enterprises to deliver and scale these use cases.
Vector databases enable enterprises to take many of the embeddings use cases we've shared in this repo (question and answering, chatbot and recommendation services, for example), and make use of them in a secure, scalable environment. Many of our customers make embeddings solve their problems at small scale but performance and security hold them back from going into production - we see vector databases as a key component in solving that, and in this guide we'll walk through the basics of embedding text data, storing it in a vector database and using it for semantic search.
Setup: Import packages and set any required variables
Load data: Load a dataset and embed it using OpenAI embeddings
Typesense
Setup: Set up the Typesense Python client. For more details go here
Index Data: We'll create a collection and index it for both titles and content.
Search Data: Run a few example queries with various goals in mind.
Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings.
Import the required libraries and set the embedding model that we'd like to use.
# We'll need to install the Typesense client!pip install typesense#Install wget to pull zip file!pip install wget
import openaifrom typing import List, Iteratorimport pandas as pdimport numpy as npimport osimport wgetfrom ast import literal_eval# Typesense's client library for Pythonimport typesense# I've set this to our new embeddings model, this can be changed to the embedding model of your choiceEMBEDDING_MODEL="text-embedding-3-small"# Ignore unclosed SSL socket warnings - optional in case you get these errorsimport warningswarnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)warnings.filterwarnings("ignore", category=DeprecationWarning)
In this section we'll load embedded data that we've prepared previous to this session.
embeddings_url ='https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'# The file is ~700 MB so this will take some timewget.download(embeddings_url)
import zipfilewith zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref: zip_ref.extractall("../data")
# Read vectors from strings back into a listarticle_df['title_vector'] = article_df.title_vector.apply(literal_eval)article_df['content_vector'] = article_df.content_vector.apply(literal_eval)# Set vector_id to be a stringarticle_df['vector_id'] = article_df['vector_id'].apply(str)
The next vector store we'll look at is Typesense, which is an open source, in-memory search engine, that you can either self-host or run on Typesense Cloud.
Typesense focuses on performance by storing the entire index in RAM (with a backup on disk) and also focuses on providing an out-of-the-box developer experience by simplifying available options and setting good defaults. It also lets you combine attribute-based filtering together with vector queries.
For this example, we will set up a local docker-based Typesense server, index our vectors in Typesense and then do some nearest-neighbor search queries. If you use Typesense Cloud, you can skip the docker setup part and just obtain the hostname and API keys from your cluster dashboard.
To run Typesense locally, you'll need Docker. Following the instructions contained in the Typesense documentation here, we created an example docker-compose.yml file in this repo saved at ./typesense/docker-compose.yml.
After starting Docker, you can start Typesense locally by navigating to the examples/vector_databases/typesense/ directory and running docker-compose up -d.
The default API key is set to xyz in the Docker compose file, and the default Typesense port to 8108.
import typesensetypesense_client = \ typesense.Client({"nodes": [{"host": "localhost", # For Typesense Cloud use xxx.a1.typesense.net"port": "8108", # For Typesense Cloud use 443"protocol": "http"# For Typesense Cloud use https }],"api_key": "xyz","connection_timeout_seconds": 60 })
To index vectors in Typesense, we'll first create a Collection (which is a collection of Documents) and turn on vector indexing for a particular field. You can even store multiple vector fields in a single document.
# Delete existing collections if they already existtry: typesense_client.collections['wikipedia_articles'].delete()exceptExceptionas e:pass# Create a new collectionschema = {"name": "wikipedia_articles","fields": [ {"name": "content_vector","type": "float[]","num_dim": len(article_df['content_vector'][0]) }, {"name": "title_vector","type": "float[]","num_dim": len(article_df['title_vector'][0]) } ]}create_response = typesense_client.collections.create(schema)print(create_response)print("Created new collection wikipedia-articles")
# Upsert the vector data into the collection we just created## Note: This can take a few minutes, especially if your on an M1 and running docker in an emulated modeprint("Indexing vectors in Typesense...")document_counter =0documents_batch = []for k,v in article_df.iterrows():# Create a document with the vector data# Notice how you can add any fields that you haven't added to the schema to the document.# These will be stored on disk and returned when the document is a hit.# This is useful to store attributes required for display purposes. document = {"title_vector": v["title_vector"],"content_vector": v["content_vector"],"title": v["title"],"content": v["text"], } documents_batch.append(document) document_counter = document_counter +1# Upsert a batch of 100 documentsif document_counter %100==0or document_counter ==len(article_df): response = typesense_client.collections['wikipedia_articles'].documents.import_(documents_batch)# print(response) documents_batch = []print(f"Processed {document_counter} / {len(article_df)} ")print(f"Imported ({len(article_df)}) articles.")
# Check the number of documents importedcollection = typesense_client.collections['wikipedia_articles'].retrieve()print(f'Collection has {collection["num_documents"]} documents')
Now that we've imported the vectors into Typesense, we can do a nearest neighbor search on the title_vector or content_vector field.
defquery_typesense(query, field='title', top_k=20):# Creates embedding vector from user query openai.api_key = os.getenv("OPENAI_API_KEY", "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx") embedded_query = openai.Embedding.create(input=query,model=EMBEDDING_MODEL, )['data'][0]['embedding'] typesense_results = typesense_client.multi_search.perform({"searches": [{"q": "*","collection": "wikipedia_articles","vector_query": f"{field}_vector:([{','.join(str(v) for v in embedded_query)}], k:{top_k})" }] }, {})return typesense_results
query_results = query_typesense('modern art in Europe', 'title')for i, hit inenumerate(query_results['results'][0]['hits']): document = hit["document"] vector_distance = hit["vector_distance"]print(f'{i +1}. {document["title"]} (Distance: {vector_distance})')
1. Museum of Modern Art (Distance: 0.12482291460037231)
2. Western Europe (Distance: 0.13255876302719116)
3. Renaissance art (Distance: 0.13584274053573608)
4. Pop art (Distance: 0.1396539807319641)
5. Northern Europe (Distance: 0.14534103870391846)
6. Hellenistic art (Distance: 0.1472070813179016)
7. Modernist literature (Distance: 0.15296930074691772)
8. Art film (Distance: 0.1567266583442688)
9. Central Europe (Distance: 0.15741699934005737)
10. European (Distance: 0.1585891842842102)
query_results = query_typesense('Famous battles in Scottish history', 'content')for i, hit inenumerate(query_results['results'][0]['hits']): document = hit["document"] vector_distance = hit["vector_distance"]print(f'{i +1}. {document["title"]} (Distance: {vector_distance})')
1. Battle of Bannockburn (Distance: 0.1306111216545105)
2. Wars of Scottish Independence (Distance: 0.1384994387626648)
3. 1651 (Distance: 0.14744246006011963)
4. First War of Scottish Independence (Distance: 0.15033596754074097)
5. Robert I of Scotland (Distance: 0.15376019477844238)
6. 841 (Distance: 0.15609073638916016)
7. 1716 (Distance: 0.15615153312683105)
8. 1314 (Distance: 0.16280347108840942)
9. 1263 (Distance: 0.16361045837402344)
10. William Wallace (Distance: 0.16464537382125854)
Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo.