Deploying OpenAI GPT-OSS Models on AWS
Complete step-by-step guide to deploying OpenAI's open-weight GPT-OSS language models on AWS, serving them as GPU-accelerated APIs with FastAPI, auto-scaling, and cost optimization strategies.
1. Overview: Choosing the Right GPT-OSS Model
OpenAI's GPT-OSS series includes two open-weight models optimized for reasoning tasks and agentic use-cases with full chain-of-thought outputs and tool-use capabilities.
GPT-OSS-20B
21B parameters
Trade-off: Lower resource requirements with faster responses, ideal for real-time services
GPT-OSS-120B
117B parameters
Trade-off: Higher accuracy and reasoning ability, requires powerful GPU hardware
Apache 2.0 License
Full access to weights and code
128k Context Window
Process extensive documents
Chain-of-Thought
Full reasoning outputs visible
Cost-Effective
Run on your own infrastructure
2. AWS Infrastructure Setup (GPU Instance, IAM, VPC)
Configure AWS EC2 GPU instances with proper networking, security, and storage for hosting GPT-OSS models.
For GPT-OSS-20B
g5.xlarge
1× A10G (24GB)
g4dn.xlarge
1× T4 (16GB)
p3.2xlarge
1× V100 (16GB)
For GPT-OSS-120B
p5.48xlarge
8× H100 (80GB)
p4de.24xlarge
8× A100 (80GB)
p4d.24xlarge
8× A100 (40GB)
Setup Checklist
Select GPU Instance
Choose based on model size and budget
Configure Storage
30-200GB EBS for model weights
Setup Networking
VPC, Security Groups, and Load Balancer
Attach IAM Role
Permissions for S3, CloudWatch, SSM
Pro Tip: Use AWS Deep Learning AMI (DLAMI) for Ubuntu or Amazon Linux. It comes pre-installed with NVIDIA GPU drivers, CUDA, and common ML frameworks, saving significant setup time.
3. Environment Setup (CUDA, PyTorch, Libraries)
Configure the EC2 instance with all necessary software, drivers, and Python packages for running GPT-OSS models.
Verify NVIDIA Driver & CUDA
Check GPU recognition and driver installation
nvidia-smi
Create Python Environment
Set up isolated Python virtual environment
sudo apt update && sudo apt install -y python3-venv python3 -m venv gpt-oss-env source gpt-oss-env/bin/activate
Install PyTorch with CUDA
Install PyTorch with CUDA 11.8 support
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
Install ML Libraries
Install Transformers, Accelerate, and MXFP4 kernel support
pip install -U transformers accelerate triton==3.4.0 kernels
Install FastAPI & Uvicorn
Set up API server dependencies
pip install fastapi uvicorn[standard]
Verify Installation
CUDA available
python -c "import torch; print('CUDA:', torch.cuda.is_available())"GPU recognized
python -c "import torch; print('GPU:', torch.cuda.get_device_name(0))"Transformers installed
python -c "import transformers; print('Version:', transformers.__version__)"Important: MXFP4 quantization requires NVIDIA Hopper (H100) or newer GPUs for native support. On older GPUs like A100/V100, the model will fall back to higher precision (bf16), requiring significantly more memory.
4. Downloading and Loading GPT-OSS Model
Obtain model weights from Hugging Face Hub and load them into GPU memory with optimized configurations.
Option 1: Auto Download
Let Transformers library handle the download automatically from Hugging Face Hub.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "openai/gpt-oss-20b" # or "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto", # Auto precision
device_map="auto" # Auto GPU allocation
)Option 2: Manual Download
Download model weights manually using Git LFS for offline installations.
# Install Git LFS
git lfs install
# Clone repository
git lfs clone https://huggingface.co/openai/gpt-oss-20b
# Load from local path
model = AutoModelForCausalLM.from_pretrained(
"/path/to/gpt-oss-20b",
torch_dtype="auto",
device_map="auto"
)Verify Model Loading
# Quick inference test
input_ids = tokenizer("Hello, GPT-OSS!", return_tensors="pt").to(model.device)
output_ids = model.generate(**input_ids, max_new_tokens=5)
print(tokenizer.decode(output_ids[0]))This should produce a continuation of the prompt, confirming the model is loaded and functional.
GPT-OSS-20B Memory
~15-16 GB
VRAM usage with 4-bit quantization
GPT-OSS-120B Memory
~60-70 GB
VRAM usage with 4-bit quantization (fits 80GB GPU)
5. Setting Up API Server with FastAPI
Expose the loaded GPT-OSS model via a high-performance REST API using FastAPI and Uvicorn.
FastAPI App (app.py)
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
class Message(BaseModel):
role: str
content: str
class ChatRequest(BaseModel):
messages: list[Message]
max_tokens: int | None = 256
temperature: float | None = 0.7
class ChatResponse(BaseModel):
generated_text: str
app = FastAPI()
print("Loading model...")
generator = pipeline(
"text-generation",
model="openai/gpt-oss-20b",
torch_dtype="auto",
device_map="auto"
)
@app.post("/chat", response_model=ChatResponse)
async def generate_chat(request: ChatRequest):
messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]
params = {}
if request.max_tokens is not None:
params["max_new_tokens"] = request.max_tokens
if request.temperature is not None:
params["temperature"] = request.temperature
outputs = generator(messages, **params)
generated_text = outputs[0]["generated_text"]
return {"generated_text": generated_text}Start the Server
uvicorn app:app --host 0.0.0.0 --port 8000
This starts FastAPI on port 8000, accessible via the EC2's public IP. The model loads at startup.
Test the API
curl -X POST "http://EC2_PUBLIC_DNS:8000/chat" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello GPT-OSS, how are you?"}]}'Alternative Deployment Options
Transformers Serve
Built-in OpenAI-compatible server
vLLM
High-throughput with continuous batching
NVIDIA Triton
Multi-framework production server
6. Optimizing GPU Inference Performance
Maximize throughput and minimize latency with proven optimization techniques for large language model inference.
36×
CPU vs GPU Throughput
Higher inference speed with GPU
40×
Real-Time Capacity
More low-latency requests served
80%
Target GPU Utilization
Optimal sustained usage
Model Quantization
Use TensorRT to convert FP32 to INT8 precision
Dynamic Batching
Aggregate multiple requests before GPU processing
Memory Management
Optimize model loading and caching strategies
Profiling & Monitoring
Use Nsight Systems and CloudWatch metrics
7. Auto-Scaling and Load Balancing
Build a production-ready, scalable infrastructure that handles variable load and ensures high availability.
Application Load Balancer
Distribute traffic across multiple EC2 instances
HTTPS listener, health checks, automatic failover
Auto Scaling Group
Dynamically adjust capacity based on demand
Scale on GPU utilization, latency, or request count
High Availability
Run instances across multiple Availability Zones
Resilience against data center failures
Managed Alternatives
Consider SageMaker or Bedrock for simplified ops
Automated scaling and infrastructure management
Scaling Best Practices
- Stateless Design: Keep API stateless so any instance can handle any request
- Predictive Scaling: Anticipate traffic spikes and scale proactively (model loading takes time)
- Custom Metrics: Use CloudWatch to monitor GPU utilization and trigger scaling
- Baseline Capacity: Keep 1-2 instances always running to handle regular traffic
8. Cost Management and Monitoring
Optimize spending and monitor usage to keep GPU infrastructure costs under control while maintaining performance.
Right-Size Instances
- Start with g5.xlarge for 20B model (~$1/hr)
- Use 120B only when accuracy demands justify cost
- Avoid over-provisioning GPU resources
Leverage Spot Instances
- 70-90% discount for fault-tolerant workloads
- Use for batch processing or dev/test
- Implement checkpointing for interruptions
Reserved Capacity
- 1-3 year terms for steady workloads
- Save 50-72% vs on-demand pricing
- Savings Plans offer flexibility
Auto-Scaling & Monitoring
- Scale down during idle periods
- Monitor GPU utilization metrics
- Set up cost alerts and budgets
Essential Monitoring Metrics
GPU Utilization
70-90%
CloudWatch Agent
Response Latency
<100ms
Application Metrics
Cost per Request
Track trends
AWS Cost Explorer
Key Takeaways
Cost-Effective GPU Acceleration
Deploy open-weight GPT-OSS models on AWS with 83% cost savings vs proprietary alternatives
Production-Ready Performance
Achieve sub-10ms latency with thousands of inferences per second using optimized GPU instances
Flexible & Scalable
Auto-scale infrastructure from development to enterprise production with full control
View Full Documentation
Access the complete step-by-step guide in PDF format with all code examples, diagrams, and technical specifications
PDF not displaying correctly? Open in browser or