REngine: Graph-Based Recommendation Engine
A recommendation engine using Neo4j and Python to analyze purchase behavior and generate personalized suggestions.
The Challenge
The goal of REngine is to transform transactional data from an e-commerce store (Sylius/MySQL) into an intelligent recommendation engine. The core challenge was migrating from a traditional relational database to a Neo4j knowledge graph, enabling real-time analysis of complex relationships between customers and products.
System Architecture
The project is built on an analytical pipeline divided into four specialized services:
- Synchronization (ETL): Migrating MySQL data (customers, products, orders) to Neo4j while filtering “noise” (excluding the top 1% best-selling products to avoid false affinities).
- Affinities: Calculating centers of interest via
INTERESTED_INrelationships. - Similarities: Identifying customers with similar purchase profiles using collaborative filtering.
- REST API: Serving recommendations through FastAPI.
Technical Choices
Why Neo4j & FastAPI?
- Neo4j: Unlike SQL, a graph allows for querying “friends of friends” or “customers who bought this also bought…” relationships with constant performance, even with millions of nodes.
- FastAPI: Chosen for its speed and native asynchronous support, making it ideal for serving graph results via a REST interface.
- Pandas: Used for in-memory customer-to-customer similarity calculations, efficiently processing large relationship matrices.
Recommendation Analysis
1. Selective Filtering: Eliminating “Noise”
Before any analysis, the system performs a surgical cleanup to keep only meaningful data. I chose to ignore “Best-seller” products (the top 1% of sales) because they attract profiles that are too diverse to be discriminative. Similarly, free or promotional items are excluded to avoid skewing customers’ true preferences.
2. Customer Similarity: The Power of Neo4j GDS
To identify similar profiles, the algorithm finds customers who share at least three products. Moving beyond a simple Pandas join, I utilized the Neo4j Graph Data Science (GDS) extension. By leveraging similarity algorithms, the system generates weighted SIMILAR_TO relationships directly within the graph, precisely quantifying the “proximity” between two customers.
3. Recommendation Engine: GraphAware & APOC
To structure suggestions, I relied on the GraphAware Recommendation Engine framework. This tool allowed me to define robust recommendation engines and use optimized procedures for score calculations. Additionally, APOC extension functions facilitated bulk calculations for Market Basket Analysis.
4. Creating “Tribes” (The Mirror Effect)
This is the heart of the system: the engine compares purchase profiles to create SIMILAR_TO kinship links.
Representation of a customer profile and their products of interest (INTERESTED_IN relationship)
If you have purchased the same products as ten other people, you join their “tribe.” The system then identifies what your tribe buys most frequently to recommend it to you.
Anti-Duplicate Intelligence
Nothing is more frustrating than a useless recommendation. I integrated a safety layer to avoid redundant suggestions.
- Example: If you are viewing a 30-capsule format of a supplement, the engine understands that suggesting the 60-capsule version is not a helpful recommendation, but a duplicate. It filters it out to suggest a more relevant complementary product.
Performance: 19 Milliseconds
This is where the choice of a graph database truly shines. Thanks to the pre-calculation of affinity links in RAM, recommendation queries execute in an average of 19 ms locally. This provides near-instant responsiveness for the end-user.
The Problem: “Common Product” Noise
During early testing, reusable bags or basic utility items created links between almost every customer, completely skewing the relevance of recommendations.
The Solution: Analytics Service
I implemented a filtering service that identifies the top 1% of sales. These products are assigned a special CommonProduct label and are excluded from similarity calculations to ensure that recommendations are based on genuine interests rather than necessity purchases.
Lessons Learned
- Cypher Query Power: Discovering Neo4j’s query language allowed me to replace hundreds of lines of procedural code with a single graph traversal query.
- In-Memory Performance: Using Pandas to orchestrate data between two databases (MySQL and Neo4j) was an excellent exercise in memory optimization.
- Layered Architecture: The strict decoupling between Migrators, Services, and Repositories makes the engine easy to maintain and evolve.
This project demonstrated that data structure is just as vital as the algorithm itself. By shifting from a “table” logic to a “relationship” logic, complex problems become radically simpler. I also learned to use specialized tools like GraphAware and GDS to turn a simple graph into a powerful predictive intelligence tool.
Retrospective
This Proof of Concept (PoC) validates the technical feasibility of the graph-based approach. For a full-scale production deployment, the next step would be automating the analytical cycle to integrate new purchase behaviors in real-time via periodic calculation refreshes.