Portfolio Project

Smart Sentence Retriever

NLP Embeddings & Serverless Retrieval

Machine Learning Automation Python AWS Docker NLP

Context

I wanted a fast way to find sentences that match a question, even when the wording is different.

Approach

  • Cleaned the text of Alice in Wonderland, split it into sentences, and precomputed embeddings.
  • Tested 6+ embedding models on 800 sentences (k=2–6) and tracked both quality (silhouette score) and model size.
  • Deployed the best-quality model behind an AWS Lambda API (with CORS) that returns the top-k matches.

Impact

  • Best silhouette score: 0.313 with Snowflake Arctic Embed L v2.0 (k=2).
  • Best score per million parameters: 0.0116 with Jina Embeddings v3 (k=6).
  • Deployed Arctic Embed L v2.0; the demo calls a Lambda endpoint to rank sentences by meaning.

System Design

This uses a fixed corpus (Alice in Wonderland): embed each sentence once, then embed each query and rank by cosine similarity.

  • Offline: clean the text, split into sentences, and store embeddings.
  • Online: embed the query, score against the cached matrix, and return the top-k with scores.
  • Frontend: a small demo that calls `/health` and `/rank` and shows the top matches.

Model Selection

  • Compared several embedding models using silhouette score.
  • Tracked silhouette score per million parameters to keep size and cost in mind.
  • Kept the setup fixed (same corpus, same sample, same k range) so results are comparable.

Serverless Deployment

  • Built a Lambda container with CPU PyTorch, FastAPI/Mangum, and the precomputed artifacts.
  • Used the plain Hugging Face stack (no sentence-transformers) to keep cold starts smaller.
  • Exposed a Lambda Function URL with CORS so the website can call it from the browser.

What I'd Improve

  • Let users upload documents and build embeddings in the background.
  • Add ANN search (HNSW/FAISS) for bigger corpora and faster top-k.
  • Add a small labeled set and evaluate with metrics like nDCG@k.

Links