FAQ Generation via Clustering
This directory contains experiments for generating FAQs by clustering student questions and computing representative centroids.
Overview
Goal: Automatically generate FAQ structures by:
- Clustering similar student questions using K-Means and Fuzzy C-Means
- Computing cluster centroids representing typical question patterns
- Extracting representative questions closest to each centroid
- Organizing questions into FAQ categories
Applications:
- Question classification: Route new questions to appropriate handlers
- Difficulty estimation: Predict question difficulty using labeled centroids
- Automated FAQ generation: Create FAQ entries for new subjects/topics
- Chatbot integration: Use centroids for real-time question classification
Experiments
CIMA Dataset
Notebook: faq_clustering_cima.ipynb
Dataset: Italian language tutoring dialogues
Status: Ready for execution
Clustering Methods:
- K-Means: Hard clustering with K-Means++ initialization
- Fuzzy C-Means (FCM): Soft clustering with membership degrees
Expected FAQ Categories:
- Vocabulary translation requests (“What is X in Italian?”)
- Grammar rule clarifications (“How do I say X?”)
- Verification questions (“Is X correct?”)
- Metacognitive help-seeking (“What’s the answer?”)
Future Datasets
faq_clustering_<dataset_name>.ipynb- Follow the same structure- Ensure consistent naming:
{dataset}_faq_YYYYMMDD_HHMMSS
Mathematical Foundation
K-Means Clustering
Minimizes within-cluster sum of squared errors (SSE):
\[SSE(S, C) = \sum_{i=1}^{k} \sum_{x \in S_i} ||x - c_i||^2\]Where:
- $S_i$: Set of points in cluster $i$
- $c_i$: Centroid of cluster $i$
- $k$: Number of clusters
Initialization: K-Means++ for better convergence
Implementation: math_investigation/clustering/kmeans.py
Fuzzy C-Means (FCM)
Minimizes fuzzy objective function:
\[J_m(U, C) = \sum_{i=1}^{c} \sum_{j=1}^{n} u_{ij}^m ||x_j - c_i||^2\]Where:
- $u_{ij} \in [0, 1]$: Membership of point $j$ to cluster $i$
- $m > 1$: Fuzziness parameter (typically 1.5-2.0)
- $\sum_{i=1}^{c} u_{ij} = 1$ (sum of memberships equals 1)
Advantage: Soft assignments allow questions to belong to multiple categories
Implementation: math_investigation/clustering/fcm.py
Evaluation Metrics
Internal Metrics (no ground truth needed)
- Silhouette Score: Measures cluster cohesion and separation (-1 to 1, higher is better)
- Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances (lower is better)
- SSE/Inertia: Within-cluster variance (lower is better, use elbow method)
FCM-Specific Metrics
- Fuzzy Partition Coefficient (FPC): Measures fuzziness of partition (0 to 1, higher is crisper)
- Objective Function $J_m$: Total weighted distance to centroids
Usage
# Activate environment
source .venv/bin/activate
# Run Jupyter notebook
jupyter notebook faq_clustering_cima.ipynb
# Or use VS Code Jupyter extension
code faq_clustering_cima.ipynb
Output Structure
Results are saved to math_investigation/results/faq_generation/:
results/faq_generation/
├── cima_faq_20260204_HHMMSS_centroids_k5.npy # Cluster centroids
├── cima_faq_20260204_HHMMSS_labels_k5.npy # Cluster assignments
├── cima_faq_20260204_HHMMSS_vocabulary.txt # Feature names
├── cima_faq_20260204_HHMMSS_faq_structure.json # FAQ entries by category
├── cima_faq_20260204_HHMMSS_analysis.json # Cluster analysis
├── cima_faq_20260204_HHMMSS_question_lengths.png # Question statistics
├── cima_faq_20260204_HHMMSS_clustering_metrics.png # Comparison plots
└── cima_faq_20260204_HHMMSS_cluster_distribution_k5.png
Integration with TFG-Chatbot
1. Load Centroids for Classification
# In chatbot/logic/tools/tools.py
import numpy as np
from math_investigation.nlp.tfidf import TFIDFVectorizer
# Load pre-trained centroids and vectorizer
centroids = np.load("results/faq_generation/cima_faq_..._centroids_k5.npy")
vectorizer = TFIDFVectorizer()
# Fit vectorizer with same vocabulary used during training
@tool
def classify_question_difficulty(question: str) -> dict:
"""
Classify question difficulty using pre-trained cluster centroids.
"""
q_vec = vectorizer.transform([question])
distances = np.linalg.norm(q_vec - centroids, axis=1)
cluster_id = np.argmin(distances)
# Map cluster_id to difficulty level (from labeled centroids)
difficulty_map = {0: "easy", 1: "medium", 2: "hard", ...}
return {
"cluster_id": int(cluster_id),
"difficulty": difficulty_map.get(cluster_id, "unknown"),
"confidence": float(1.0 / (distances[cluster_id] + 1e-6))
}
2. Use FAQ Structure for Retrieval
# Load FAQ structure into MongoDB or use in RAG
with open("results/faq_generation/cima_faq_..._faq_structure.json") as f:
faq_data = json.load(f)
# Store in MongoDB
from chatbot.db.connection import get_database
db = get_database()
db.faqs.insert_one(faq_data)
# Retrieve similar FAQs in RAG tool
@tool
def search_faq(question: str) -> list:
"""Search FAQ entries similar to question."""
# Classify question to cluster
cluster_id = classify_question_difficulty(question)["cluster_id"]
# Retrieve FAQ category
faq = db.faqs.find_one({"faq_categories.category_id": cluster_id})
return faq["faq_categories"][cluster_id]["faq_entries"]
3. Train Difficulty Centroids
See scripts/train_difficulty_centroids.py for labeling and training difficulty classifiers using these centroids.
Comparison: Topic Modeling vs. FAQ Clustering
| Aspect | Topic Modeling (NMF) | FAQ Clustering (K-Means/FCM) |
|---|---|---|
| Goal | Discover latent themes in documents | Group similar questions for classification |
| Output | Topic-word distributions | Cluster centroids and assignments |
| Use Case | Content organization, document understanding | Question routing, difficulty prediction |
| Method | Matrix factorization ($V \approx WH$) | Distance-based clustering |
| Interpretability | Topics = word distributions | Clusters = representative questions |
Both approaches complement each other:
- Topic modeling: Understand what students talk about
- Clustering: Classify how students ask questions
Dependencies
All implementations are from-scratch in math_investigation/:
nlp/tfidf.py- TF-IDF vectorizerclustering/kmeans.py- K-Means with K-Means++clustering/fcm.py- Fuzzy C-Meansclustering/metrics.py- Silhouette, Davies-Bouldin, ARI, NMI
References
- TFG Mathematics thesis: Chapter on clustering algorithms
- K-Means paper: MacQueen (1967)
- FCM paper: Dunn (1973), Bezdek (1981)
- Integration:
scripts/train_difficulty_centroids.py