Portfolio Project
Smart Sentence Retriever
NLP Embeddings & Serverless Retrieval
Context
I wanted a fast way to find sentences that match a question, even when the wording is different.
Approach
- Cleaned the text of Alice in Wonderland, split it into sentences, and precomputed embeddings.
- Tested 6+ embedding models on 800 sentences (k=2–6) and tracked both quality (silhouette score) and model size.
- Deployed the best-quality model behind an AWS Lambda API (with CORS) that returns the top-k matches.
Impact
- Best silhouette score: 0.313 with Snowflake Arctic Embed L v2.0 (k=2).
- Best score per million parameters: 0.0116 with Jina Embeddings v3 (k=6).
- Deployed Arctic Embed L v2.0; the demo calls a Lambda endpoint to rank sentences by meaning.
System Design
This uses a fixed corpus (Alice in Wonderland): embed each sentence once, then embed each query and rank by cosine similarity.
- Offline: clean the text, split into sentences, and store embeddings.
- Online: embed the query, score against the cached matrix, and return the top-k with scores.
- Frontend: a small demo that calls `/health` and `/rank` and shows the top matches.
Model Selection
- Compared several embedding models using silhouette score.
- Tracked silhouette score per million parameters to keep size and cost in mind.
- Kept the setup fixed (same corpus, same sample, same k range) so results are comparable.
Serverless Deployment
- Built a Lambda container with CPU PyTorch, FastAPI/Mangum, and the precomputed artifacts.
- Used the plain Hugging Face stack (no sentence-transformers) to keep cold starts smaller.
- Exposed a Lambda Function URL with CORS so the website can call it from the browser.
What I'd Improve
- Let users upload documents and build embeddings in the background.
- Add ANN search (HNSW/FAISS) for bigger corpora and faster top-k.
- Add a small labeled set and evaluate with metrics like nDCG@k.