This notebook provides a step-by-step guide on how to optimizing gpt-oss models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
TensorRT-LLM supports both models:
gpt-oss-20b
gpt-oss-120b
In this guide, we will run gpt-oss-20b, if you want to try the larger model or want more customization refer to this deployment guide.
Note: Your input prompts should use the harmony response format for the model to work properly, though this guide does not require it.
You can simplify the environment setup by using NVIDIA Brev. Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.
Once deployed, click on the "Open Notebook" button to get start with this guide
There are multiple ways to install TensorRT-LLM. In this guide, we'll cover using a pre-built Docker container from NVIDIA NGC as well as building from source.
If you're using NVIDIA Brev, you can skip this section.
Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC.
This is the easiest way to get started and ensures all dependencies are included.
Alternatively, you can build the TensorRT-LLM container from source.
This approach is useful if you want to modify the source code or use a custom branch.
For detailed instructions, see the official documentation.
TensorRT-LLM will be available through pip soon
Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch.
In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to:
Download the specified model weights from Hugging Face (using your HF_TOKEN for authentication).
Automatically build the TensorRT engine for your GPU architecture if it does not already exist.
Load the model and prepare it for inference.
Run a simple text generation example to verify everything is working.
Note: The first run may take several minutes as it downloads the model and builds the engine.
Subsequent runs will be much faster, as the engine will be cached.
llm = LLM(model="openai/gpt-oss-20b")
prompts = ["Hello, my name is", "The capital of France is"]sampling_params = SamplingParams(temperature=0.8, top_p=0.95)for output in llm.generate(prompts, sampling_params):print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")
Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API.
In this notebook, you have learned how to:
Set up your environment with the necessary dependencies.
Use the tensorrt_llm.LLM API to download a model from the Hugging Face Hub.
Automatically build a high-performance TensorRT engine tailored to your GPU.
Run inference with the optimized model.
You can explore more advanced features to further improve performance and efficiency:
Benchmarking: Try running a benchmark to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time.
Quantization: TensorRT-LLM supports various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware.
Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the NVIDIA Dynamo for robust, scalable, and multi-model serving.