Fine-Tuning and Deploying LLaMA 3 with Unsloth
Written on February 28th, 2025 by Cody SniderIntroduction
So you want to fine-tune LLaMA 3 but don’t want to fight with Hugging Face’s API rate limits, RAM-hungry models, or mysterious CUDA errors? Enter Unsloth, a lightweight alternative to train and serve models efficiently using 4-bit quantization and LoRA adapters. This guide walks you through training a fine-tuned model and deploying it as an inference API using FastAPI.
Prerequisites
You’ll need:
- A GPU with at least 24GB VRAM (or be ready to tweak batch sizes)
- Docker
1. Preparing the Dataset
We’re working with a dataset of cleaned emails to emulate someone’s writing style formatted in JSONL:
{"instruction": "Write me an email.", "response": "EMAIL GOES HERE"}
{"instruction": "Write me an email.", "response": "ANOTHER EMAIL GOES HERE"}
Make sure your dataset is in ./data_in/dataset.jsonl. If it’s missing, you’ll see an error.
2. Fine-Tuning LLaMA 3
Training Script (train.py)
import os
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Paths
DATA_IN = "/data/in"
DATA_OUT = "/data/out"
dataset_path = os.path.join(DATA_IN, "dataset.jsonl")
if not os.path.exists(dataset_path):
raise FileNotFoundError(f"Dataset not found: {dataset_path}")
# Load dataset
dataset = load_dataset("json", data_files={"train": dataset_path}, split="train")
# Format dataset
def format_dataset(example):
example["text"] = f"Instruction: {example['instruction']}\nResponse: {example['response']}"
return example
dataset = dataset.map(format_dataset)
# Load model
model_name = "unsloth/llama-3-8b-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name, max_seq_length=2048, load_in_4bit=True
)
# Apply LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
use_gradient_checkpointing="unsloth",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
num_train_epochs=3,
warmup_ratio=0.05,
learning_rate=2e-5,
weight_decay=0.01,
lr_scheduler_type="cosine",
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=10,
save_strategy="epoch",
save_total_limit=3,
output_dir=DATA_OUT,
optim="adamw_8bit",
seed=1234,
),
)
trainer.train()
After training, your fine-tuned LoRA adapters will be saved in ./data_out. Make note of the latest checkpoint folder name and update the inference_server.py with that.
3. Deploying the Inference API
Inference Server (inference_server.py)
import os
import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from unsloth import FastLanguageModel
from transformers import AutoTokenizer
MODEL_NAME = "unsloth/llama-3-8b-bnb-4bit"
OUTPUT_DIR = "/data/out"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model, _ = FastLanguageModel.from_pretrained(model_name=MODEL_NAME, max_seq_length=2048, load_in_4bit=True)
# Load LoRA adapters
lora_path = os.path.join(OUTPUT_DIR, "LATEST CHECKPOINT FOLDER GOES HERE")
if os.path.exists(lora_path):
model.load_adapter(lora_path)
else:
raise FileNotFoundError(f"LoRA checkpoint not found: {lora_path}")
FastLanguageModel.for_inference(model)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# FastAPI setup
app = FastAPI(title="LLaMA 3 Inference API", version="1.0")
class TextRequest(BaseModel):
prompt: str
max_new_tokens: int = 100
@app.post("/generate")
async def generate_text(request: TextRequest):
try:
inputs = tokenizer(request.prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=request.max_new_tokens, do_sample=True, temperature=0.8, top_p=0.9)
return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/")
def home():
return {"message": "LLaMA 3 Inference API is running!"}
4. Running Everything with Docker
Trainer Dockerfile (Dockerfile.trainer)
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
# Set up environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
# Install system dependencies
RUN apt-get update && apt-get install -y \
python3 python3-pip git wget && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Install PyTorch with CUDA support
RUN pip3 install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install required Python packages
RUN pip3 install --no-cache-dir \
transformers datasets trl unsloth accelerate peft bitsandbytes
# Create workspace
WORKDIR /app
# Set entrypoint
ENTRYPOINT ["python3", "train.py"]
Inference Dockerfile (Dockerfile.inference)
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
WORKDIR /app
RUN apt-get update && apt-get install -y python3 python3-pip && \
pip install fastapi uvicorn torch transformers unsloth
COPY ./inference_server.py /app/inference_server.py
CMD ["uvicorn", "inference_server:app", "--host", "0.0.0.0", "--port", "8000"]
Running with Docker
Training
docker build -t llama_trainer -f Dockerfile.trainer .
docker run --gpus all \
-v "$(pwd)/data_in:/data/in:rw" \
-v "$(pwd)/data_out:/data/out:rw" \
-v "$(pwd)/app:/app" \
llama_trainer
Inference
docker build -f Dockerfile.inference -t llama3-inference .
docker run --gpus all -v "$(pwd)/data_out:/data/out" -p 8000:8000 llama3-inference
Conclusion
If you followed this guide, congrats - you now have:
- A fine-tuned LLaMA 3 model using Unsloth and LoRA.
- A FastAPI-based inference server running your fine-tuned model.
- Dockerized everything so you don’t have to mess with dependencies.
Now go forth and fine-tune everything - emails, support chatbots, or meme generators.