tiktoken
is a fast open-source tokenizer by OpenAI.
Given a text string (e.g., "tiktoken is great!"
) and an encoding (e.g., "cl100k_base"
), a tokenizer can split the text string into a list of tokens (e.g., ["t", "ik", "token", " is", " great", "!"]
).
Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).
Encodings
Encodings specify how text is converted into tokens. Different models use different encodings.
tiktoken
supports three encodings used by OpenAI models:
Encoding name | OpenAI models |
---|---|
o200k_base | gpt-4o , gpt-4o-mini |
cl100k_base | gpt-4-turbo , gpt-4 , gpt-3.5-turbo , text-embedding-ada-002 , text-embedding-3-small , text-embedding-3-large |
p50k_base | Codex models, text-davinci-002 , text-davinci-003 |
r50k_base (or gpt2 ) | GPT-3 models like davinci |
You can retrieve the encoding for a model using tiktoken.encoding_for_model()
as follows:
encoding = tiktoken.encoding_for_model('gpt-4o-mini')
Note that p50k_base
overlaps substantially with r50k_base
, and for non-code applications, they will usually give the same tokens.
Tokenizer libraries by language
For o200k_base
, cl100k_base
and p50k_base
encodings:
- Python: tiktoken
- .NET / C#: SharpToken, TiktokenSharp
- Java: jtokkit
- Golang: tiktoken-go
- Rust: tiktoken-rs
For r50k_base
(gpt2
) encodings, tokenizers are available in many languages.
- Python: tiktoken (or alternatively GPT2TokenizerFast)
- JavaScript: gpt-3-encoder
- .NET / C#: GPT Tokenizer
- Java: gpt2-tokenizer-java
- PHP: GPT-3-Encoder-PHP
- Golang: tiktoken-go
- Rust: tiktoken-rs
(OpenAI makes no endorsements or guarantees of third-party libraries.)
How strings are typically tokenized
In English, tokens commonly range in length from one character to one word (e.g., "t"
or " great"
), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., " is"
instead of "is "
or " "
+"is"
). You can quickly check how a string is tokenized at the OpenAI Tokenizer, or the third-party Tiktokenizer webapp.