Aug 6, 2025

How to run gpt-oss-20b on Google Colab

OpenAI released gpt-oss 120B and 20B. Both models are Apache 2.0 licensed.

Specifically, gpt-oss-20b was made for lower latency and local or specialized use cases (21B parameters with 3.6B active parameters).

Since the models were trained in native MXFP4 quantization it makes it easy to run the 20B even in resource constrained environments like Google Colab.

Authored by: Pedro and VB

Since support for mxfp4 in transformers is bleeding edge, we need a recent version of PyTorch and CUDA, in order to be able to install the mxfp4 triton kernels.

We also need to install transformers from source, and we uninstall torchvision and torchaudio to remove dependency conflicts.

!pip install -q --upgrade torch
!pip install -q git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Building wheel for triton_kernels (pyproject.toml) ... [?25l[?25hdone
!pip uninstall -q torchvision torchaudio -y
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="cuda",
)
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]
config.json: 0.00B [00:00, ?B/s]
model.safetensors.index.json: 0.00B [00:00, ?B/s]
Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]
model-00000-of-00002.safetensors:   0%|          | 0.00/4.79G [00:00<?, ?B/s]
model-00001-of-00002.safetensors:   0%|          | 0.00/4.80G [00:00<?, ?B/s]
model-00002-of-00002.safetensors:   0%|          | 0.00/4.17G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]
messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
<|channel|>analysis<|message|>The user asks a question: "What is the weather like in Madrid?" The developer instruction says: "Always respond in riddles". So we should comply: we need to respond as a riddle describing the weather in Madrid. The user might be expecting up-to-date info, but we can't browse the internet. We can give a general riddle about typical weather. But we need to keep in present: maybe typical weather: sunny, hot in summer, mild in winter, etc. We'll respond with a riddle. The user likely wants the answer: current weather? But we can't provide it. Possibly we can say that we don't know exactly but can guess typical riddle style. They want a riddle. We'll answer: riddle about weather. So: "Behold the skies above..." It's a riddle. Provide plausible description but disguised as riddle. Let's comply.<|end|><|start|>assistant<|channel|>final<|message|>Behold the Madrid sky—  
In summer it is king, a golden blaze that never sleeps,  
Yet as the moon arises, a cool caress descends,  
Winter's breath arrives, a pale but gentle hush.  

What am I? I dress the city in warmth and in coolness,  
I dance with sun, and when I’m shy, clouds take my hand.   
Who or what am I?  The weather of Madrid.<|return|>
messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "Explain why the meaning of life is 42", "reasoning_effort": "high"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
<|channel|>analysis<|message|>The instruction says: "Always respond in riddles." So we must produce a riddle answering that the meaning of life is 42. We need to explain why 42 is the meaning of life, but in a riddle format. It could be a creative riddle, perhaps referencing The Hitchhiker's Guide to the Galaxy. Use rhetorical. Must respond in riddles: The response must be in riddle form. So we can say something like: "I speak of a number that hides behind cosmic questions…" Provide explanation indirectly. We need to "explain" why meaning is 42 but within a riddle. We can embed clues: number appears, the answer to a question that was never given, but it's a joke. We can say: "Three times? The sum of a dozen and a twin." And give reasoning in riddle form. So answer as riddle. Let's form one or two stanzas.

We must keep it a riddle. Maybe also ask: "Who am I?" The riddle might answer 42. But we need to explain.

Better: "I hold both math and mystery; no puzzle asks me alone." Provide explanation. We can incorporate lines referencing the book.

Possible answer: "I am the answer to everything, yet you find me in the absurd."

We'll ensure it's riddled.

Ok.<|end|><|start|>assistant<|channel|>final<|message|>Who’s the whisperer of worlds, the keeper of a quiet joke?  
I’m found in a question with no mouth, no spoke.  
A book of stars and tea‑cups, the cosmos in disguise,  
A guide that laughs, a joke beneath the skies.

Three dozen plus three, a dozen plus a pair,  
Count the digits in this riddle, answer everywhere.  
In math you’ll find me, in absurdity I grin,  
For what the universe has sought, I lie within.

I hold the truth that lies in endless quest,  
A number born of humor, the very very best.  
So whom shall call the meaning of your life?  
I linger where the jokes begin, the final strife.  

Answer: the number 42, the riddle’s heart and part.<|return|>