Containerized Voice Identification with Resemblyzer & QdrantDB

Introduction

Voice recognition is evolving fast. In this guide, we build a containerized speaker identification system using Resemblyzer for voice embeddings, QdrantDB for vector storage, and a FastAPI interface for ease of access. If you’re tired of dependency hell and want a scalable solution that runs anywhere, read on.

System Overview

The system consists of two primary endpoints:

  • /upload_identity: Upload known audio clips (MP3/WAV) to register identities.
  • /identify: Submit an audio sample to identify the speaker against the stored identities.

Persistent data is maintained under /data/identities within the container.

Building the Container

The Dockerfile below sets up the environment with necessary dependencies like ffmpeg for audio processing and build-essential for compiling C extensions.

FROM python:3.9-slim

RUN apt-get update && apt-get install -y ffmpeg build-essential && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

The accompanying docker-compose.yml file integrates QdrantDB:

version: '3.8'
services:
  api:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - data:/data
    depends_on:
      - qdrant
  qdrant:
    image: qdrant/qdrant:latest
    environment:
      - QDRANT__STORAGE__DATA_PATH=/data/qdrant
    volumes:
      - data:/data
    ports:
      - "6333:6333"
volumes:
  data:

FastAPI Interface and Endpoint Details

The API leverages Resemblyzer to preprocess audio files and generate 256-dimensional embeddings. These embeddings are stored in QdrantDB, enabling rapid similarity searches.

Below is the /identify endpoint with execution time logging:

@app.post("/identify")
async def identify(file: UploadFile = File(...), top_k: int = 3):
    start_time = time.time()

    if not file.filename.lower().endswith((".wav", ".mp3")):
        raise HTTPException(status_code=400, detail="Unsupported file type.")

    # Save temporarily
    temp_path = f"/tmp/{int(time.time())}_{file.filename}"
    with open(temp_path, "wb") as f:
        shutil.copyfileobj(file.file, f)

    try:
        wav = preprocess_wav(temp_path)
    except Exception as e:
        os.remove(temp_path)
        raise HTTPException(status_code=400, detail=f"Error processing audio: {e}")

    os.remove(temp_path)
    embedding = encoder.embed_utterance(wav)

    search_results = qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=embedding.tolist(),
        limit=top_k
    )

    if not search_results:
        exec_time = time.time() - start_time
        print(f"Identity check execution time: {exec_time:.3f} seconds")
        return JSONResponse(content={"message": "No identities found.", "execution_time": exec_time})

    matches = {}
    for point in search_results:
        name = point.payload.get("name", "unknown")
        matches.setdefault(name, []).append(point.score)

    avg_matches = {name: sum(scores)/len(scores) for name, scores in matches.items()}
    best_match = max(avg_matches.items(), key=lambda x: x[1])

    exec_time = time.time() - start_time
    print(f"Identity check execution time: {exec_time:.3f} seconds")

    return JSONResponse(content={
        "best_match": best_match[0],
        "score": best_match[1],
        "raw_results": [
            {"id": p.id, "name": p.payload.get("name", "unknown"), "score": p.score} for p in search_results
        ],
        "execution_time": exec_time
    })

Testing the API

You can test the endpoints using cURL commands:

# Upload a known identity
curl -X POST "http://0.0.0.0:8000/upload_identity" \
  -F "identity_name=Ronaldo" \
  -F "file=@/path/to/Ronaldo.mp3"

# Identify a speaker from a test audio clip
curl -X POST "http://0.0.0.0:8000/identify" \
  -F "file=@/path/to/test_audio.mp3" \
  -F "top_k=3"

Conclusion

This containerized voice identification system combines Resemblyzer’s efficient audio embeddings with QdrantDB’s fast vector search, all wrapped in a FastAPI interface for simplicity and scalability. By containerizing the solution, you avoid dependency conflicts and ensure that your environment is consistent across deployments. If you need a robust and repeatable voice biometric solution, this setup delivers.

GitHub: codysnider/resemblyzer-qdrantdb