FAQ Generation via Clustering

This directory contains experiments for generating FAQs by clustering student questions and computing representative centroids.

Overview

Goal: Automatically generate FAQ structures by:

Clustering similar student questions using K-Means and Fuzzy C-Means
Computing cluster centroids representing typical question patterns
Extracting representative questions closest to each centroid
Organizing questions into FAQ categories

Applications:

Question classification: Route new questions to appropriate handlers
Difficulty estimation: Predict question difficulty using labeled centroids
Automated FAQ generation: Create FAQ entries for new subjects/topics
Chatbot integration: Use centroids for real-time question classification

Experiments

CIMA Dataset

Notebook: faq_clustering_cima.ipynb
Dataset: Italian language tutoring dialogues
Status: Ready for execution

Clustering Methods:

K-Means: Hard clustering with K-Means++ initialization
Fuzzy C-Means (FCM): Soft clustering with membership degrees

Expected FAQ Categories:

Vocabulary translation requests (“What is X in Italian?”)
Grammar rule clarifications (“How do I say X?”)
Verification questions (“Is X correct?”)
Metacognitive help-seeking (“What’s the answer?”)

Future Datasets

faq_clustering_<dataset_name>.ipynb - Follow the same structure
Ensure consistent naming: {dataset}_faq_YYYYMMDD_HHMMSS

Mathematical Foundation

K-Means Clustering

Minimizes within-cluster sum of squared errors (SSE):

\[SSE(S, C) = \sum_{i=1}^{k} \sum_{x \in S_i} ||x - c_i||^2\]

Where:

$S_i$: Set of points in cluster $i$
$c_i$: Centroid of cluster $i$
$k$: Number of clusters

Initialization: K-Means++ for better convergence
Implementation: math_investigation/clustering/kmeans.py

Fuzzy C-Means (FCM)

Minimizes fuzzy objective function:

\[J_m(U, C) = \sum_{i=1}^{c} \sum_{j=1}^{n} u_{ij}^m ||x_j - c_i||^2\]

Where:

$u_{ij} \in [0, 1]$: Membership of point $j$ to cluster $i$
$m > 1$: Fuzziness parameter (typically 1.5-2.0)
$\sum_{i=1}^{c} u_{ij} = 1$ (sum of memberships equals 1)

Advantage: Soft assignments allow questions to belong to multiple categories
Implementation: math_investigation/clustering/fcm.py

Evaluation Metrics

Internal Metrics (no ground truth needed)

Silhouette Score: Measures cluster cohesion and separation (-1 to 1, higher is better)
Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances (lower is better)
SSE/Inertia: Within-cluster variance (lower is better, use elbow method)

FCM-Specific Metrics

Fuzzy Partition Coefficient (FPC): Measures fuzziness of partition (0 to 1, higher is crisper)
Objective Function $J_m$: Total weighted distance to centroids

Usage

# Activate environment
source .venv/bin/activate

# Run Jupyter notebook
jupyter notebook faq_clustering_cima.ipynb

# Or use VS Code Jupyter extension
code faq_clustering_cima.ipynb

Output Structure

Results are saved to math_investigation/results/faq_generation/:

results/faq_generation/
├── cima_faq_20260204_HHMMSS_centroids_k5.npy      # Cluster centroids
├── cima_faq_20260204_HHMMSS_labels_k5.npy         # Cluster assignments
├── cima_faq_20260204_HHMMSS_vocabulary.txt        # Feature names
├── cima_faq_20260204_HHMMSS_faq_structure.json    # FAQ entries by category
├── cima_faq_20260204_HHMMSS_analysis.json         # Cluster analysis
├── cima_faq_20260204_HHMMSS_question_lengths.png  # Question statistics
├── cima_faq_20260204_HHMMSS_clustering_metrics.png # Comparison plots
└── cima_faq_20260204_HHMMSS_cluster_distribution_k5.png

Integration with TFG-Chatbot

1. Load Centroids for Classification

# In chatbot/logic/tools/tools.py
import numpy as np
from math_investigation.nlp.tfidf import TFIDFVectorizer

# Load pre-trained centroids and vectorizer
centroids = np.load("results/faq_generation/cima_faq_..._centroids_k5.npy")
vectorizer = TFIDFVectorizer()
# Fit vectorizer with same vocabulary used during training

@tool
def classify_question_difficulty(question: str) -> dict:
    """
    Classify question difficulty using pre-trained cluster centroids.
    """
    q_vec = vectorizer.transform([question])
    distances = np.linalg.norm(q_vec - centroids, axis=1)
    cluster_id = np.argmin(distances)
    
    # Map cluster_id to difficulty level (from labeled centroids)
    difficulty_map = {0: "easy", 1: "medium", 2: "hard", ...}
    
    return {
        "cluster_id": int(cluster_id),
        "difficulty": difficulty_map.get(cluster_id, "unknown"),
        "confidence": float(1.0 / (distances[cluster_id] + 1e-6))
    }

2. Use FAQ Structure for Retrieval

# Load FAQ structure into MongoDB or use in RAG
with open("results/faq_generation/cima_faq_..._faq_structure.json") as f:
    faq_data = json.load(f)

# Store in MongoDB
from chatbot.db.connection import get_database
db = get_database()
db.faqs.insert_one(faq_data)

# Retrieve similar FAQs in RAG tool
@tool
def search_faq(question: str) -> list:
    """Search FAQ entries similar to question."""
    # Classify question to cluster
    cluster_id = classify_question_difficulty(question)["cluster_id"]
    
    # Retrieve FAQ category
    faq = db.faqs.find_one({"faq_categories.category_id": cluster_id})
    return faq["faq_categories"][cluster_id]["faq_entries"]

3. Train Difficulty Centroids

See scripts/train_difficulty_centroids.py for labeling and training difficulty classifiers using these centroids.

Comparison: Topic Modeling vs. FAQ Clustering

Aspect	Topic Modeling (NMF)	FAQ Clustering (K-Means/FCM)
Goal	Discover latent themes in documents	Group similar questions for classification
Output	Topic-word distributions	Cluster centroids and assignments
Use Case	Content organization, document understanding	Question routing, difficulty prediction
Method	Matrix factorization ($V \approx WH$)	Distance-based clustering
Interpretability	Topics = word distributions	Clusters = representative questions

Both approaches complement each other:

Topic modeling: Understand what students talk about
Clustering: Classify how students ask questions

Dependencies

All implementations are from-scratch in math_investigation/:

nlp/tfidf.py - TF-IDF vectorizer
clustering/kmeans.py - K-Means with K-Means++
clustering/fcm.py - Fuzzy C-Means
clustering/metrics.py - Silhouette, Davies-Bouldin, ARI, NMI

References

TFG Mathematics thesis: Chapter on clustering algorithms
K-Means paper: MacQueen (1967)
FCM paper: Dunn (1973), Bezdek (1981)
Integration: scripts/train_difficulty_centroids.py