AI/ML Deployment
Featured Guide

Deploying OpenAI GPT-OSS Models on AWS

Complete step-by-step guide to deploying OpenAI's open-weight GPT-OSS language models on AWS, serving them as GPU-accelerated APIs with FastAPI, auto-scaling, and cost optimization strategies.

Ensar Solutions Team
January 20, 2024
18 min read
OpenAIGPT-OSSAWSGPUFastAPIAuto-ScalingMLOps

1. Overview: Choosing the Right GPT-OSS Model

OpenAI's GPT-OSS series includes two open-weight models optimized for reasoning tasks and agentic use-cases with full chain-of-thought outputs and tool-use capabilities.

GPT-OSS-20B

21B parameters

Memory Required~16 GB GPU memory
PrecisionMXFP4 (4-bit)
Best ForLower latency, lightweight deployment
PerformanceOn par with o3-mini

Trade-off: Lower resource requirements with faster responses, ideal for real-time services

GPT-OSS-120B

117B parameters

Memory Required~60-80 GB GPU memory
PrecisionMXFP4 (4-bit)
Best ForMaximum accuracy, high reasoning quality
PerformanceComparable to o4-mini

Trade-off: Higher accuracy and reasoning ability, requires powerful GPU hardware

Apache 2.0 License

Full access to weights and code

128k Context Window

Process extensive documents

Chain-of-Thought

Full reasoning outputs visible

Cost-Effective

Run on your own infrastructure

2. AWS Infrastructure Setup (GPU Instance, IAM, VPC)

Configure AWS EC2 GPU instances with proper networking, security, and storage for hosting GPT-OSS models.

For GPT-OSS-20B

Recommended

g5.xlarge

1× A10G (24GB)

$1.00/hr

g4dn.xlarge

1× T4 (16GB)

$0.53/hr

p3.2xlarge

1× V100 (16GB)

$3.06/hr

For GPT-OSS-120B

Recommended

p5.48xlarge

8× H100 (80GB)

$98/hr
Recommended

p4de.24xlarge

8× A100 (80GB)

$27.45/hr

p4d.24xlarge

8× A100 (40GB)

$23/hr

Setup Checklist

Select GPU Instance

Choose based on model size and budget

Configure Storage

30-200GB EBS for model weights

Setup Networking

VPC, Security Groups, and Load Balancer

Attach IAM Role

Permissions for S3, CloudWatch, SSM

Pro Tip: Use AWS Deep Learning AMI (DLAMI) for Ubuntu or Amazon Linux. It comes pre-installed with NVIDIA GPU drivers, CUDA, and common ML frameworks, saving significant setup time.

3. Environment Setup (CUDA, PyTorch, Libraries)

Configure the EC2 instance with all necessary software, drivers, and Python packages for running GPT-OSS models.

1

Verify NVIDIA Driver & CUDA

Check GPU recognition and driver installation

nvidia-smi
2

Create Python Environment

Set up isolated Python virtual environment

sudo apt update && sudo apt install -y python3-venv
python3 -m venv gpt-oss-env
source gpt-oss-env/bin/activate
3

Install PyTorch with CUDA

Install PyTorch with CUDA 11.8 support

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118
4

Install ML Libraries

Install Transformers, Accelerate, and MXFP4 kernel support

pip install -U transformers accelerate triton==3.4.0 kernels
5

Install FastAPI & Uvicorn

Set up API server dependencies

pip install fastapi uvicorn[standard]

Verify Installation

CUDA available

python -c "import torch; print('CUDA:', torch.cuda.is_available())"

GPU recognized

python -c "import torch; print('GPU:', torch.cuda.get_device_name(0))"

Transformers installed

python -c "import transformers; print('Version:', transformers.__version__)"

Important: MXFP4 quantization requires NVIDIA Hopper (H100) or newer GPUs for native support. On older GPUs like A100/V100, the model will fall back to higher precision (bf16), requiring significantly more memory.

4. Downloading and Loading GPT-OSS Model

Obtain model weights from Hugging Face Hub and load them into GPU memory with optimized configurations.

Option 1: Auto Download

Let Transformers library handle the download automatically from Hugging Face Hub.

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "openai/gpt-oss-20b"  # or "openai/gpt-oss-120b"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",  # Auto precision
    device_map="auto"    # Auto GPU allocation
)

Option 2: Manual Download

Download model weights manually using Git LFS for offline installations.

# Install Git LFS
git lfs install

# Clone repository
git lfs clone https://huggingface.co/openai/gpt-oss-20b

# Load from local path
model = AutoModelForCausalLM.from_pretrained(
    "/path/to/gpt-oss-20b",
    torch_dtype="auto",
    device_map="auto"
)

Verify Model Loading

# Quick inference test
input_ids = tokenizer("Hello, GPT-OSS!", return_tensors="pt").to(model.device)
output_ids = model.generate(**input_ids, max_new_tokens=5)
print(tokenizer.decode(output_ids[0]))

This should produce a continuation of the prompt, confirming the model is loaded and functional.

GPT-OSS-20B Memory

~15-16 GB

VRAM usage with 4-bit quantization

GPT-OSS-120B Memory

~60-70 GB

VRAM usage with 4-bit quantization (fits 80GB GPU)

5. Setting Up API Server with FastAPI

Expose the loaded GPT-OSS model via a high-performance REST API using FastAPI and Uvicorn.

FastAPI App (app.py)

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

class Message(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: list[Message]
    max_tokens: int | None = 256
    temperature: float | None = 0.7

class ChatResponse(BaseModel):
    generated_text: str

app = FastAPI()

print("Loading model...")
generator = pipeline(
    "text-generation",
    model="openai/gpt-oss-20b",
    torch_dtype="auto",
    device_map="auto"
)

@app.post("/chat", response_model=ChatResponse)
async def generate_chat(request: ChatRequest):
    messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]
    
    params = {}
    if request.max_tokens is not None:
        params["max_new_tokens"] = request.max_tokens
    if request.temperature is not None:
        params["temperature"] = request.temperature
    
    outputs = generator(messages, **params)
    generated_text = outputs[0]["generated_text"]
    
    return {"generated_text": generated_text}

Start the Server

uvicorn app:app --host 0.0.0.0 --port 8000

This starts FastAPI on port 8000, accessible via the EC2's public IP. The model loads at startup.

Test the API

curl -X POST "http://EC2_PUBLIC_DNS:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello GPT-OSS, how are you?"}]}'

Alternative Deployment Options

Transformers Serve

Built-in OpenAI-compatible server

vLLM

High-throughput with continuous batching

NVIDIA Triton

Multi-framework production server

6. Optimizing GPU Inference Performance

Maximize throughput and minimize latency with proven optimization techniques for large language model inference.

36×

CPU vs GPU Throughput

Higher inference speed with GPU

40×

Real-Time Capacity

More low-latency requests served

80%

Target GPU Utilization

Optimal sustained usage

Model Quantization

Use TensorRT to convert FP32 to INT8 precision

2-3× throughput improvement

Dynamic Batching

Aggregate multiple requests before GPU processing

Significantly improved GPU utilization

Memory Management

Optimize model loading and caching strategies

Eliminate cold-start latency

Profiling & Monitoring

Use Nsight Systems and CloudWatch metrics

Identify and eliminate bottlenecks

7. Auto-Scaling and Load Balancing

Build a production-ready, scalable infrastructure that handles variable load and ensures high availability.

Application Load Balancer

Distribute traffic across multiple EC2 instances

HTTPS listener, health checks, automatic failover

Auto Scaling Group

Dynamically adjust capacity based on demand

Scale on GPU utilization, latency, or request count

High Availability

Run instances across multiple Availability Zones

Resilience against data center failures

Managed Alternatives

Consider SageMaker or Bedrock for simplified ops

Automated scaling and infrastructure management

Scaling Best Practices

  • Stateless Design: Keep API stateless so any instance can handle any request
  • Predictive Scaling: Anticipate traffic spikes and scale proactively (model loading takes time)
  • Custom Metrics: Use CloudWatch to monitor GPU utilization and trigger scaling
  • Baseline Capacity: Keep 1-2 instances always running to handle regular traffic

8. Cost Management and Monitoring

Optimize spending and monitor usage to keep GPU infrastructure costs under control while maintaining performance.

Right-Size Instances

Match capacity to actual needs
  • Start with g5.xlarge for 20B model (~$1/hr)
  • Use 120B only when accuracy demands justify cost
  • Avoid over-provisioning GPU resources

Leverage Spot Instances

Up to 90% cost reduction
  • 70-90% discount for fault-tolerant workloads
  • Use for batch processing or dev/test
  • Implement checkpointing for interruptions

Reserved Capacity

~$0.15-0.20/hr effective cost
  • 1-3 year terms for steady workloads
  • Save 50-72% vs on-demand pricing
  • Savings Plans offer flexibility

Auto-Scaling & Monitoring

Eliminate idle resource waste
  • Scale down during idle periods
  • Monitor GPU utilization metrics
  • Set up cost alerts and budgets

Essential Monitoring Metrics

GPU Utilization

70-90%

CloudWatch Agent

Response Latency

<100ms

Application Metrics

Cost per Request

Track trends

AWS Cost Explorer

Key Takeaways

Cost-Effective GPU Acceleration

Deploy open-weight GPT-OSS models on AWS with 83% cost savings vs proprietary alternatives

Production-Ready Performance

Achieve sub-10ms latency with thousands of inferences per second using optimized GPU instances

Flexible & Scalable

Auto-scale infrastructure from development to enterprise production with full control

Complete PDF Guide

View Full Documentation

Access the complete step-by-step guide in PDF format with all code examples, diagrams, and technical specifications

PDF not displaying correctly? Open in browser or