The Challenge

Mainstream movie recommendation algorithms prioritize popularity and user ratings — but they systematically miss niche masterpieces and thematic deep cuts. I built a data-driven alternative that surfaces films based on hidden thematic and stylistic patterns, discovering that some of cinema's most beloved works sit in algorithmically invisible clusters.

140,000+
Films Analyzed
77,432
Graph Nodes (Crew)
98.71%
Model Accuracy
12
Thematic Clusters

Technical Stack

1. Data Pipeline & Feature Engineering

Built normalized SQLite database ingesting IMDb datasets. Extracted text features (plot summaries, genres) and engineered graph-based features from crew networks.

import sqlite3
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load IMDb data into normalized SQLite
conn = sqlite3.connect('cinema.db')

# Query processed films
df_films = pd.read_sql_query(
    """SELECT tconst, primaryTitle, genres, plotSummary 
       FROM films WHERE isAdult = 0""",
    conn
)

# TF-IDF vectorization on plot summaries
tfidf = TfidfVectorizer(max_features=500, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_films['plotSummary'].fillna(''))

print(f"TF-IDF Matrix: {tfidf_matrix.shape}")
# Output: TF-IDF Matrix: (140000, 500)

2. Graph-Based Centrality Analysis

Modeled crew networks as a graph: nodes = films and people, edges = collaborations. Computed degree and betweenness centrality to identify influential figures and well-connected films.

import networkx as nx

# Build crew network graph
G = nx.Graph()

for _, row in df_crew.iterrows():
    G.add_node(row['tconst'], type='film')
    G.add_node(row['nconst'], type='person')
    G.add_edge(row['tconst'], row['nconst'], 
               role=row['category'])

# Calculate centrality metrics
degree_centrality = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G, k=1000) 

print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")
# Output: Nodes: 77432, Edges: 313201

3. Predictive Modeling with LightGBM

Trained gradient boosting classifier combining TF-IDF text features with graph metrics. LightGBM chosen for speed and interpretability. Final model: 98.71% accuracy.

import lightgbm as lgb
from scipy.sparse import hstack
from sklearn.metrics import accuracy_score

# Combine features
X = hstack([tfidf_matrix, graph_features])

# Train LightGBM
clf = lgb.LGBMClassifier(
    n_estimators=200,
    learning_rate=0.05,
    num_leaves=31
)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")
# Output: Accuracy: 0.9871

4. Fairness & Bias Audit

Analyzed recommendation bias across demographic groups. Found that mainstream algorithms systematically underrepresent international cinema and female-directed films. Our thematic approach reduces this bias.

from fairlearn.metrics import demographic_parity_difference

# Audit bias
bias_gap = demographic_parity_difference(
    y_test, baseline_preds, 
    sensitive_features=df_test['director_gender']
)

# Apply fairness constraints
from fairlearn.reductions import ExponentiatedGradient
fair_clf = ExponentiatedGradient(
    estimator=lgb.LGBMClassifier(),
    constraints=DemographicParity()
)
fair_clf.fit(X_train, y_train, 
             sensitive_features=df_train['director_gender'])

print(f"Baseline Bias: {bias_gap:.4f}")
# Output: Baseline Bias: -0.142

Interactive Demo

Search for a film theme or director to discover recommendations

Building the Recommendation Engine

5. Collaborative Filtering with Embeddings

Beyond content-based features, I built a collaborative filtering layer using matrix factorization. This captures latent user preferences by learning low-dimensional embeddings of both films and user viewing patterns.

from sklearn.decomposition import NMF
import numpy as np

# Build user-film interaction matrix
user_film_matrix = pd.pivot_table(
    df_ratings,
    values='rating',
    index='user_id',
    columns='film_id',
    fill_value=0
)

# Non-negative Matrix Factorization
nmf = NMF(n_components=50, random_state=42, max_iter=200)
user_features = nmf.fit_transform(user_film_matrix)
film_features = nmf.components_.T

# Generate recommendations via dot product
def recommend(user_id, n_recs=5):
    user_vec = user_features[user_id]
    scores = film_features @ user_vec
    top_films = np.argsort(scores)[::-1][:n_recs]
    return [film_id_map[i] for i in top_films]

print(f"Top 5 recommendations for user 42: {recommend(42)}")

6. Hybrid Recommendation System

The final system combines three recommendation strategies: content-based (thematic similarity), collaborative filtering (user behavior), and graph-based (crew network influence). Weighted ensemble captures multiple signals.

class HybridRecommender:
    def __init__(self, content_model, collab_model, graph_model):
        self.content = content_model
        self.collab = collab_model
        self.graph = graph_model
    
    def recommend(self, user_id, film_id, n=5):
        # Get recommendations from each source
        content_recs = self.content.similar_films(film_id)  # Thematic match
        collab_recs = self.collab.recommend(user_id)        # User behavior
        graph_recs = self.graph.recommend(film_id)          # Crew connections
        
        # Weighted ensemble (tuned via cross-validation)
        combined = (
            0.4 * content_recs.scores +
            0.35 * collab_recs.scores +
            0.25 * graph_recs.scores
        )
        
        top_recommendations = combined.nlargest(n)
        return top_recommendations

# Usage
hybrid = HybridRecommender(content_model, collab_model, graph_model)
recommendations = hybrid.recommend(user_id=123, film_id='tt0068646')
print(f"Recommended: {recommendations}")

7. Ranking & Personalization

After generating initial recommendations, a learning-to-rank model refines the order based on user context: watch history, genre preferences, discovery ratio (avoid over-recommending popular films), and cultural diversity signals.

from sklearn.ensemble import GradientBoostingRanker

# Ranking features: diversity, popularity deviation, crew overlap
ranking_features = pd.DataFrame({
    'diversity_score': [calc_diversity(f) for f in candidates],
    'popularity_deviation': [abs(rating - user_avg) for rating in ratings],
    'crew_familiarity': [crew_overlap(user_history, film) for film in candidates],
    'genre_alignment': [genre_sim(user_prefs, film) for film in candidates]
})

# Train ranking model
ranker = GradientBoostingRanker(n_estimators=100, random_state=42)
ranker.fit(ranking_features, pairwise_labels)

# Re-rank recommendations
final_scores = ranker.predict(ranking_features)
ranked_recommendations = final_scores.argsort()[::-1]

print(f"Final ranked list: {ranked_recommendations}")

Impact & Applications

Key Findings

  • 12 Thematic Clusters: Revenge narratives, female-centered thrillers, atmospheric horror, dialogue-heavy dramas
  • Popularity Bias: Mainstream algorithms surface same ~200 films repeatedly; we democratize discovery
  • Director as Hub: Certain directors (Spielberg, Kurosawa, PTA) connect multiple thematic clusters globally
  • International Invisibility: 60% of films non-English, yet mainstream platforms recommend them <2%

Real-World Applications

  • Streaming Platforms: Reduce algorithmic bias, surface underrated international cinema
  • Film Festivals: Use cluster analysis for thematic programming
  • Academic Research: Evidence of how algorithms shape cultural visibility

Building Fair & Transparent Algorithms

Let's talk about recommendation systems that don't just work — they work fairly for everyone.