Fine-Tuning and Deploying LLaMA 3 with Unsloth

Introduction

So you want to fine-tune LLaMA 3 but don’t want to fight with Hugging Face’s API rate limits, RAM-hungry models, or mysterious CUDA errors? Enter Unsloth, a lightweight alternative to train and serve models efficiently using 4-bit quantization and LoRA adapters. This guide walks you through training a fine-tuned model and deploying it as an inference API using FastAPI.

Prerequisites

You’ll need:

  • A GPU with at least 24GB VRAM (or be ready to tweak batch sizes)
  • Docker

1. Preparing the Dataset

We’re working with a dataset of cleaned emails to emulate someone’s writing style formatted in JSONL:

  {"instruction": "Write me an email.", "response": "EMAIL GOES HERE"}
  {"instruction": "Write me an email.", "response": "ANOTHER EMAIL GOES HERE"}

Make sure your dataset is in ./data_in/dataset.jsonl. If it’s missing, you’ll see an error.

2. Fine-Tuning LLaMA 3

Training Script (train.py)

import os
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Paths
DATA_IN = "/data/in"
DATA_OUT = "/data/out"
dataset_path = os.path.join(DATA_IN, "dataset.jsonl")

if not os.path.exists(dataset_path):
    raise FileNotFoundError(f"Dataset not found: {dataset_path}")

# Load dataset
dataset = load_dataset("json", data_files={"train": dataset_path}, split="train")

# Format dataset
def format_dataset(example):
    example["text"] = f"Instruction: {example['instruction']}\nResponse: {example['response']}"
    return example

dataset = dataset.map(format_dataset)

# Load model
model_name = "unsloth/llama-3-8b-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name, max_seq_length=2048, load_in_4bit=True
)

# Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    use_gradient_checkpointing="unsloth",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    tokenizer=tokenizer,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        num_train_epochs=3,
        warmup_ratio=0.05,
        learning_rate=2e-5,
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        save_strategy="epoch",
        save_total_limit=3,
        output_dir=DATA_OUT,
        optim="adamw_8bit",
        seed=1234,
    ),
)

trainer.train()

After training, your fine-tuned LoRA adapters will be saved in ./data_out. Make note of the latest checkpoint folder name and update the inference_server.py with that.

3. Deploying the Inference API

Inference Server (inference_server.py)

import os
import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

MODEL_NAME = "unsloth/llama-3-8b-bnb-4bit"
OUTPUT_DIR = "/data/out"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model, _ = FastLanguageModel.from_pretrained(model_name=MODEL_NAME, max_seq_length=2048, load_in_4bit=True)

# Load LoRA adapters
lora_path = os.path.join(OUTPUT_DIR, "LATEST CHECKPOINT FOLDER GOES HERE")
if os.path.exists(lora_path):
    model.load_adapter(lora_path)
else:
    raise FileNotFoundError(f"LoRA checkpoint not found: {lora_path}")

FastLanguageModel.for_inference(model)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# FastAPI setup
app = FastAPI(title="LLaMA 3 Inference API", version="1.0")

class TextRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 100

@app.post("/generate")
async def generate_text(request: TextRequest):
    try:
        inputs = tokenizer(request.prompt, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=request.max_new_tokens, do_sample=True, temperature=0.8, top_p=0.9)
            return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/")
def home():
    return {"message": "LLaMA 3 Inference API is running!"}

4. Running Everything with Docker

Trainer Dockerfile (Dockerfile.trainer)

FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

# Set up environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# Install system dependencies
RUN apt-get update && apt-get install -y \
    python3 python3-pip git wget && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Install PyTorch with CUDA support
RUN pip3 install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install required Python packages
RUN pip3 install --no-cache-dir \
    transformers datasets trl unsloth accelerate peft bitsandbytes

# Create workspace
WORKDIR /app

# Set entrypoint
ENTRYPOINT ["python3", "train.py"]

Inference Dockerfile (Dockerfile.inference)

FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

WORKDIR /app

RUN apt-get update && apt-get install -y python3 python3-pip && \
    pip install fastapi uvicorn torch transformers unsloth

COPY ./inference_server.py /app/inference_server.py

CMD ["uvicorn", "inference_server:app", "--host", "0.0.0.0", "--port", "8000"]

Running with Docker

Training

docker build -t llama_trainer -f Dockerfile.trainer .
docker run --gpus all \
  -v "$(pwd)/data_in:/data/in:rw" \
  -v "$(pwd)/data_out:/data/out:rw" \
  -v "$(pwd)/app:/app" \
  llama_trainer

Inference

docker build -f Dockerfile.inference -t llama3-inference .
docker run --gpus all -v "$(pwd)/data_out:/data/out" -p 8000:8000 llama3-inference

Conclusion

If you followed this guide, congrats - you now have:

  • A fine-tuned LLaMA 3 model using Unsloth and LoRA.
  • A FastAPI-based inference server running your fine-tuned model.
  • Dockerized everything so you don’t have to mess with dependencies.

Now go forth and fine-tune everything - emails, support chatbots, or meme generators.