Data Science & Analytics | IT Operations
A data generalist with experience in IT operations, and everything business automations. My interest is in real-life applications of data to business and everyday problems.
Let's Connect
LinkedIn | Twitter | Blog |
View my Resume

Before there was an architecture diagram, a matching algorithm, or a single API endpoint, there was a problem I saw even before anyone could notice it.
The ALX Nigeria community was growing, pretty fast. What started as a network of learners quickly became tens of thousands of graduates domiciled in every state of Nigeria, each on a different path, at a different stage, with different goals. And yet, they all had one thing in common: the need to connect with the right people.
Not just anyone, but people who understood their journey.
At such growth rate, meaningful connections became harder to facilitate. I thought of a solution that connected people. These early attempts relied on manual processes, spreadsheets, bulk emails, and meeting coordination, which proved the value of structured matching but quickly became unsustainable.
The challenges were consistent: mismatched pairings, no-shows, limited feedback, and increasing operational overhead with each cycle.
It became clear that this wasn’t just a coordination issue, it required a purpose-built system.
What we needed was an engine that could intelligently match people, automate the interaction flow, and scale without manual intervention.
This document outlines how that system, ALX Connect, was designed and built.
Early-career professionals often struggle to find the right guidance, mentors, and opportunities. ALX Connect solves this by enabling users to discover and connect with other members of the community based on skills, experience, and career interests.
ALX Connect is a networking and mentorship discovery platform built for the ALX community. It helps learners and alumni connect based on skills, experience, and career interests, creating a collaborative ecosystem where knowledge and opportunities are shared.
ALX Connect is unique through two key features: an automated matching system and an automated feedback system. The platform automatically matches community members based on shared skills, experience, and career interests. Furthermore, the feedback system is designed not only to improve the platform but also to help identify and highlight emerging mentors within the community who consistently offer valuable guidance.
By combining machine learning, data-driven recommendations, and automated feedback mechanism, ALX Connect helps community members:
Ultimately, ALX Connect transforms the ALX community into a living network of knowledge, support, and opportunity, helping talented individuals grow and succeed in the global workforce. And the best part, this requires little effort from the team.
The implementation is in two versions: Cloud Run (full semantic model) and Render (lightweight TF-IDF).
Sentence Transformers (all-MiniLM-L6-v2), a pretrained transformer model that generates dense semantic vectors, was chosen for its balance of speed and accuracy, and its 384-dim output aligns well with the target dimension. For the Render version, TF-IDF + TruncatedSVD was used in place of the sentence transfromer (Render has memory constraints preventing loading large transformer models). This combination converts text to sparse term-frequency vectors then reduces it to 384 dimensions. It captures keyword overlap but, unlike sentence transformers, lacks deep semantic understanding.
HNSWlib (Hierarchical Navigable Small World): a dedicated approximate nearest neighbour index optimised for cosine similarity was used as it has a simple API and good performance for my use case. Sklearn was also implemented as a fallback if HNSWlib fails (this was implemented for Render). Slower but guarantees exact nearest neighbours.
The matching system uses a multi-stage similarity and clustering pipeline designed to maximize high-confidence matches while ensuring every profile is paired. The process progressively relaxes constraints to maintain match quality before falling back to forced pairing strategies.
The system progressively moves through four stages:
This layered approach ensures that high similarity matches are prioritized yet ensuring complete coverage for all profiles.
These matches are considered high-confidence pairs because both participants are each other’s strongest similarity signal. This stage is executed first to capture the most reliable matches in the dataset.
Greedy Matching from Top Neighbours: After mutual pairs, there are still many possible good matches. A greedy approach maximises overall similarity among the remaining profiles without requiring exhaustive search. Once a match is made, both profiles are removed from the pool. This step helps capture strong but non-mutual matches that were missed in the first stage.
Cluster-Then-Match Strategy: This stage handles the leftovers that were not matched in the previous stages. These profiles are typically less similar to anyone, so we try to find any reasonable groupings using K-Means clustering. Clustering provides a coarse grouping based on overall similarity, which then allows fine‑grained pairing within each group. Unlike centroid-based assignment, this method matches profiles based on direct pairwise similarity, preventing centroid bias. Profiles that remain unmatched within clusters are moved to a global singleton pool.
forced-match to distinguish them from similarity-based matches.If an odd number of singletons remains, the final singleton is added to an existing pair forming a trio group. This guarantees no profile remains unmatched.
The process for generating profile embeddings and identifying similar profiles involves several distinct, sequential steps, ensuring robust similarity matching and detailed output reporting.
Validate the input to ensure each profile includes an id, personal_summary, professional_summary, and years_experience and in the right format. This validation is performed using Pydantic within FastAPI.
class ProfileInput(BaseModel):
id: str
personal_summary: str
professional_summary: str
years_experience: int
profile data (id, years_experience, personal_summary, professional_summary (prefixed with a_, b_, and optionally c_)), similarity (the similarity score of the match) and source (one of “nearest”, “greedy”, “cluster”, “forced-match”). The output also includes number of doubles and the number of trios for monitoring group composition.
Throughout the pipeline, memory usage is logged using psutil. After the matching completes, gc.collect() is called explicitly to free temporary objects (e.g., large similarity matrices). This is important to ensure that memory size doesn’t outgrow available resources. The consequence is constant shutdown on memory-constrained platforms like Render or increased platform costs on auto-scale infra like Cloud Run.
A lot of trade-offs were made to ensure the model runs on the free-tier of Render and Cloud run at first.
This was built for a maximum of 10k profiles and so far tested with about 1k. Scaling it beyond that would require further testing to ensure fit for purpose. You might also want to consider a few extra things during scale:
ALX Connect stands as proof that powerful community tools can emerge from understanding a problem deeply and applying technology thoughtfully. It’s a foundation designed not just to match profiles, but to scale the very human need for connection and mutual growth. The real measure of its success isn’t just in the lines of code or the choice of library, but in the sustained, meaningful interactions it enables within a vibrant community.