Portfolio Project

Baby Name Predictor

Python ML Pipeline

Machine Learning Python scikit-learn

Context

My wife asked me to suggest baby names. I wanted something that learns her taste instead of guessing.

Approach

  • Aggregated and cleaned 140+ years of SSA records and engineered trend features.
  • Built a simple 'quiz' script to collect like/dislike labels.
  • Trained several models and averaged their scores to produce recommendations.

Impact

  • Generated personalized top 50 name lists for boys and girls.
  • Helped us narrow the list when naming our child.

Data and Labeling

I combined Social Security Administration name data with preference labels to build a personalized recommender.

  • Aggregated 140+ years of SSA records and added recency/trend features so it doesn’t just recommend the most common names.
  • Focused on Colorado to keep suggestions closer to what my wife actually hears day to day.
  • Collected labels through quick quizzes so the model learns her taste over time.

Feature Engineering

  • Name shape features: length, vowel/consonant mix, syllable count, start/end vowel flags, and entropy.
  • Popularity features: total count, peak year, and recent-count features to capture saturation and momentum.
  • A rough origin signal via `langdetect` as a lightweight proxy.

Modeling Strategy

  • Trained multiple models (Random Forest, XGBoost, SVM, KNN, plus a deep learning baseline) with randomized hyperparameter search.
  • Optimized for weighted F1 to handle class imbalance in 'liked' vs. 'not liked' labels.
  • Averaged predictions across models to reduce quirks from any single model.

Recommendation Workflow

  • Generated a ranked Top 50 list for boys and girls and exported results for easy review.
  • Designed the loop so new feedback becomes new training data.
  • Kept it explainable by surfacing which features correlated with higher predicted preference.

What I'd Improve

  • Collect more labels and add calibration so probabilities map to real acceptance rates.
  • Use name embeddings (character-level or phonetic) instead of hand-crafted features alone.
  • Add diversity constraints so recommendations aren’t overly similar to each other.

Links