Get embeddings from dataset

This notebook gives an example on how to get embeddings from a large dataset.

1. Load the dataset

The dataset used in this example is fine-food reviews from Amazon. The dataset contains a total of 568,454 food reviews Amazon users left up to October 2012. We will use a subset of this dataset, consisting of 1,000 most recent reviews for illustration purposes. The reviews are in English and tend to be positive or negative. Each review has a ProductId, UserId, Score, review title (Summary) and review body (Text).

We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding.

To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

# load & inspect dataset input_datapath = "data/fine_food_reviews_1k.csv" # to save space, we provide a pre-filtered dataset df = pd.read_csv(input_datapath, index_col=0) df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]] df = df.dropna() df["combined"] = ( "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip() ) df.head(2)

	Time	ProductId	UserId	Score	Summary	Text	combined
0	1351123200	B003XPF9BO	A3R7JR3FMEBXQB	5	where does one start...and stop... with a tre...	Wanted to save some to bring to my Chicago fam...	Title: where does one start...and stop... wit...
1	1351123200	B003JK537S	A3JBPC3WFUT5ZP	1	Arrived in pieces	Not pleased at all. When I opened the box, mos...	Title: Arrived in pieces; Content: Not pleased...

Time

ProductId

UserId

Score

Summary

Text

combined

1351123200

B003XPF9BO

A3R7JR3FMEBXQB

where does one start...and stop... with a tre...

Wanted to save some to bring to my Chicago fam...

Title: where does one start...and stop... wit...

1351123200

B003JK537S

A3JBPC3WFUT5ZP

Arrived in pieces

Not pleased at all. When I opened the box, mos...

Title: Arrived in pieces; Content: Not pleased...

# subsample to 1k most recent reviews and remove samples that are too long top_n = 1000 df = df.sort_values("Time").tail(top_n * 2) # first cut to first 2k entries, assuming less than half will be filtered out df.drop("Time", axis=1, inplace=True) encoding = tiktoken.get_encoding(embedding_encoding) # omit reviews that are too long to embed df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x))) df = df[df.n_tokens <= max_tokens].tail(top_n) len(df)

# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage # This may take a few minutes df["embedding"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model)) df.to_csv("data/fine_food_reviews_with_embeddings_1k.csv")

Get embeddings from dataset

1. Load the dataset

2. Get embeddings and save them for future reuse