Topic Modeling Coherence Metrics
Overview
This document provides a technical and scientific summary of the topic coherence metrics implemented in the math_investigation/topic_modeling/coherence.py module. Topic coherence metrics evaluate the semantic quality of topics discovered by Non-negative Matrix Factorization (NMF) and other topic modeling algorithms.
Unlike clustering metrics that focus on partition quality, coherence metrics assess whether the top words in a topic are semantically related and interpretable by humans. These metrics are crucial for:
- Selecting the optimal number of topics
- Comparing different topic modeling algorithms
- Tuning hyperparameters (e.g., regularization)
- Evaluating topic interpretability
What is Topic Coherence?
Definition: Topic coherence measures the degree of semantic similarity between high-scoring words in a topic.
Intuition: A coherent topic should contain words that frequently co-occur in documents and are semantically related. For example:
- High coherence: {car, vehicle, drive, road, engine}
- Low coherence: {car, apple, theory, blue, database}
Implemented Metrics
1. UCI Coherence
Mathematical Definition
UCI (University of California, Irvine) coherence is based on Pointwise Mutual Information (PMI) computed over sliding windows:
\[C_{\text{UCI}} = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \log \frac{P(w_i, w_j) + \epsilon}{P(w_i) \cdot P(w_j)}\]where:
- $n$ = number of top words in the topic (typically 10-20)
- $w_i, w_j$ = words in the topic
- $P(w_i)$ = probability of word $w_i$ appearing in a sliding window
- $P(w_i, w_j)$ = probability of words $w_i$ and $w_j$ co-occurring in a sliding window
- $\epsilon$ = smoothing factor (typically $10^{-10}$) to avoid log(0)
Sliding Window Approach
The probabilities are estimated using a sliding window over documents:
- Window size: Typically 10 tokens (configurable parameter)
- Count word occurrences: Each word presence in a window is counted once
- Count co-occurrences: Each word pair co-occurrence in a window is counted once
Interpretation
- Range: $(-\infty, +\infty)$ in theory, typically $[-15, 15]$ in practice
- Higher is better: Positive values indicate words co-occur more than expected by chance
- Negative values: Words co-occur less than expected (poor coherence)
- Zero: Independent co-occurrence (random association)
PMI Intuition
PMI measures the association strength between word pairs:
\[\text{PMI}(w_i, w_j) = \log \frac{P(w_i, w_j)}{P(w_i) \cdot P(w_j)}\]- Positive PMI: Words co-occur more than expected → related
- Zero PMI: Words co-occur as expected → independent
- Negative PMI: Words co-occur less than expected → unrelated
When to Use
- Best for: External corpus validation (using original documents)
- Advantages:
- Sensitive to local word associations
- Reflects human perception of coherence
- Works with raw text (no preprocessing artifacts)
- Use cases:
- Selecting optimal number of topics
- Comparing topic models on same corpus
- Evaluating topic interpretability
Computational Complexity
- Window extraction: $O(D \cdot L)$ where $D$ = documents, $L$ = average document length
- Coherence computation: $O(T \cdot n^2)$ where $T$ = topics, $n$ = top words per topic
- Total: $O(D \cdot L + T \cdot n^2)$
2. UMass Coherence
Mathematical Definition
UMass (University of Massachusetts) coherence uses document-level co-occurrence with conditional probabilities:
\[C_{\text{UMass}} = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \log \frac{D(w_i, w_j) + 1}{D(w_i)}\]where:
- $D(w_i)$ = number of documents containing word $w_i$
- $D(w_i, w_j)$ = number of documents containing both words $w_i$ and $w_j$
- Smoothing factor +1 prevents log(0)
This can be interpreted as:
\[C_{\text{UMass}} = \frac{2}{n(n-1)} \sum_{i<j} \log P(w_j | w_i)\]| where $P(w_j | w_i)$ is the conditional probability of seeing $w_j$ given $w_i$ appears. |
Document Co-occurrence Approach
Unlike UCI, UMass uses document-level co-occurrence:
- Binary presence: Check if word appears in document (frequency doesn’t matter)
- Document frequency: Count documents containing each word
- Co-document frequency: Count documents containing both words
where $D(w)$ represents document frequency (number of documents containing the word).
Interpretation
- Range: $(-\infty, 0]$ typically
- Values closer to 0 are better: Less negative = higher coherence
- Negative values: Indicates conditional probability < 1 (always true unless perfect co-occurrence)
- Very negative values (< -10): Poor topic coherence
Why Negative Values?
Since $D(w_i, w_j) \leq D(w_i)$, we have:
\[\frac{D(w_i, w_j) + 1}{D(w_i)} \leq \frac{D(w_i) + 1}{D(w_i)} \approx 1\]Thus $\log(\cdot) \leq 0$ in most cases. The metric measures how much less than 1 this conditional probability is.
When to Use
- Best for: Quick evaluation using term-document matrix
- Advantages:
- Computationally efficient (no sliding window)
- Works directly with document-term matrix
- Stable with small corpora
- Correlates well with human judgments
- Use cases:
- Fast topic quality assessment
- Grid search over hyperparameters
- Large-scale topic modeling
Computational Complexity
- Binary matrix conversion: $O(D \cdot V)$ where $V$ = vocabulary size
- Coherence computation: $O(T \cdot n^2)$
- Total: $O(D \cdot V + T \cdot n^2)$
- Generally faster than UCI (no window sliding)
Comparison: UCI vs UMass
| Aspect | UCI Coherence | UMass Coherence |
|---|---|---|
| Co-occurrence level | Sliding window (local) | Document-level (global) |
| Value range | $(-\infty, +\infty)$, often $[-15, 15]$ | $(-\infty, 0]$, often $[-20, 0]$ |
| Interpretation | Higher = better | Closer to 0 = better |
| Requires | Raw documents + tokenization | Document-term matrix |
| Computation | Slower (window sliding) | Faster (matrix operations) |
| Sensitivity | Local context, word ordering | Global co-occurrence patterns |
| Corpus size | Works with small corpora | Needs reasonable document counts |
| Human correlation | High (0.6-0.8) | High (0.6-0.7) |
Theoretical Foundations
Pointwise Mutual Information (PMI)
PMI is an information-theoretic measure of association:
\[\text{PMI}(x, y) = \log \frac{P(x, y)}{P(x) \cdot P(y)} = \log \frac{P(x|y)}{P(x)} = \log \frac{P(y|x)}{P(y)}\]Interpretation:
- Measures how much more (or less) likely $x$ and $y$ are to co-occur than if they were independent
- Related to mutual information: $I(X;Y) = \sum \sum P(x,y) \cdot \text{PMI}(x,y)$
Conditional Probability Approach
UMass coherence uses conditional probability:
\[P(w_j | w_i) = \frac{P(w_i, w_j)}{P(w_i)}\]This measures: “If I see word $w_i$, what’s the probability I’ll also see $w_j$?”
High conditional probabilities → strong word associations → coherent topics
Usage Guidelines
Selecting Number of Topics
Approach: Compute coherence for different values of $k$ (number of topics)
from math_investigation.topic_modeling.coherence import uci_coherence, umass_coherence
# Range of topics to test
k_values = [3, 5, 7, 10, 15, 20]
coherence_scores = {'k': [], 'uci': [], 'umass': []}
for k in k_values:
# Train NMF with k topics
nmf = NMF(n_components=k)
W = nmf.fit_transform(X)
H = nmf.components_
# Get top words per topic
topics = extract_top_words(H, feature_names, n_words=10)
# Compute coherence
uci_scores = uci_coherence(topics, documents, window_size=10)
umass_scores = umass_coherence(topics, X, feature_names)
coherence_scores['k'].append(k)
coherence_scores['uci'].append(np.mean(list(uci_scores.values())))
coherence_scores['umass'].append(np.mean(list(umass_scores.values())))
# Select k with highest coherence
optimal_k_uci = k_values[np.argmax(coherence_scores['uci'])]
optimal_k_umass = k_values[np.argmax(coherence_scores['umass'])] # closest to 0
Plot: Coherence vs. number of topics (look for peak or elbow)
Evaluating Individual Topics
# Per-topic coherence for interpretation
topics = extract_top_words(H, feature_names, n_words=10)
uci_scores = uci_coherence(topics, documents)
for topic_id, score in uci_scores.items():
print(f"Topic {topic_id}: {score:.3f}")
print(f" Top words: {topics[topic_id]}")
print()
# Filter out low-coherence topics
good_topics = {tid: words for tid, words in topics.items()
if uci_scores[tid] > threshold}
Hyperparameter Tuning
Use coherence metrics for:
- NMF regularization: Test different $\alpha$ and $\beta$ values
- Initialization methods: Compare random, NNDSVD, etc.
- Preprocessing: Compare different stopword lists, min_df/max_df
Implementation Details
Module: math_investigation/topic_modeling/coherence.py
Design Principles
- From-scratch implementation: No scikit-learn or Gensim dependencies
- Educational focus: Clear, readable code with explicit formulas
- Efficient NumPy operations: Vectorized computations where possible
- Type hints and documentation: Comprehensive docstrings
Key Functions
def uci_coherence(
topics: dict[int, list[str]],
documents: list[str],
window_size: int = 10,
) -> dict[int, float]:
"""
Args:
topics: {topic_idx: [top_words]}
documents: Original document texts
window_size: Sliding window size (default: 10)
Returns:
{topic_idx: coherence_score}
"""
def umass_coherence(
topics: dict[int, list[str]],
doc_term_matrix: np.ndarray,
feature_names: list[str],
) -> dict[int, float]:
"""
Args:
topics: {topic_idx: [top_words]}
doc_term_matrix: Document-term matrix (will be binarized)
feature_names: Vocabulary in order
Returns:
{topic_idx: coherence_score}
"""
Preprocessing Steps
Both metrics include:
- Tokenization: Lowercase, remove punctuation, extract words
- Stopword removal: Filter common words using
STOPWORDSset - Minimum word length: 2+ characters (configurable)
Optimization Considerations
For large corpora:
- UCI: Consider sampling documents or limiting window count
- UMass: Sparse matrix operations for efficiency
- Caching: Store co-occurrence counts if computing multiple coherence values
Validation and Benchmarks
Human Correlation Studies
Research shows coherence metrics correlate with human topic interpretability:
| Metric | Correlation with Human Judgments |
|---|---|
| UCI Coherence | $r = 0.65 - 0.80$ |
| UMass Coherence | $r = 0.60 - 0.75$ |
| C_V (not implemented) | $r = 0.70 - 0.85$ |
Source: Röder et al. (2015)
Expected Value Ranges
Typical coherence values for real-world topics:
| Quality | UCI Coherence | UMass Coherence |
|---|---|---|
| Excellent | > 5 | > -2 |
| Good | 2 to 5 | -2 to -5 |
| Moderate | 0 to 2 | -5 to -10 |
| Poor | < 0 | < -10 |
Note: These are approximate guidelines; actual ranges depend on corpus characteristics
Limitations and Considerations
UCI Coherence
Limitations:
- Sensitive to window size parameter
- Computationally expensive for large corpora
- Requires well-preprocessed text
- May be affected by document length distribution
Best practices:
- Use window_size=10 as default (validated in literature)
- Ensure consistent tokenization
- Remove very frequent and very rare words
- Consider downsampling for very large corpora
UMass Coherence
Limitations:
- Less sensitive to local context
- Assumes binary word presence (ignores frequency)
- Can be unstable with very small corpora
- Biased toward frequent words
Best practices:
- Ensure adequate corpus size (100+ documents minimum)
- Use appropriate min_df/max_df thresholds
- Validate results with multiple metrics
- Consider document length normalization
Advanced Topics
C_V Coherence (Not Implemented)
C_V coherence combines multiple measures and typically achieves highest human correlation:
\[C_V = \text{cosine}(\vec{s}_{\text{NPMI}}, \vec{s}_{\text{context}})\]where NPMI is normalized PMI and context vectors capture semantic relationships.
Why not implemented: Requires external semantic models (word embeddings) and is more complex. UCI and UMass are sufficient for most TFG applications.
Topic Diversity Metrics
In addition to coherence, topic diversity measures uniqueness:
\[\text{Diversity} = \frac{1}{T} \sum_{t=1}^{T} \frac{|\text{unique words in topic } t|}{n}\]High diversity + high coherence = optimal topic model
Integration with Math Investigation
Question Classification Pipeline
- NMF Topic Modeling: Decompose question embeddings
- Coherence Evaluation: Select optimal number of topics
- Topic Assignment: Map questions to dominant topic
- Difficulty Clustering: Within-topic K-Means/FCM clustering
Example workflow:
# scripts/train_difficulty_centroids.py integration
from math_investigation.topic_modeling.coherence import uci_coherence
from math_investigation.clustering.kmeans import KMeans
# 1. Topic modeling
topics = extract_top_words(H, feature_names, n_words=10)
coherence = uci_coherence(topics, questions)
# 2. Filter high-coherence topics
good_topics = [t for t in range(n_topics) if coherence[t] > threshold]
# 3. Cluster within topics
for topic_id in good_topics:
topic_questions = W[:, topic_id] > threshold
kmeans = KMeans(n_clusters=3) # Easy/Medium/Hard
labels = kmeans.fit_predict(X[topic_questions])
Chatbot Enhancement
Topic coherence helps the chatbot:
- Question routing: High-coherence topics → reliable classification
- Content organization: Group similar questions by topic
- Quality assessment: Monitor topic coherence over time as new questions arrive
References
Academic Sources
-
UCI Coherence: Newman, D., Lau, J.H., Grieser, K., & Baldwin, T. (2010). “Automatic evaluation of topic coherence”. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, 100-108.
-
UMass Coherence: Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). “Optimizing semantic coherence in topic models”. Proceedings of EMNLP 2011, 262-272.
-
Coherence Survey: Röder, M., Both, A., & Hinneburg, A. (2015). “Exploring the space of topic coherence measures”. Proceedings of WSDM 2015, 399-408.
-
PMI Theory: Church, K.W., & Hanks, P. (1990). “Word association norms, mutual information, and lexicography”. Computational Linguistics, 16(1), 22-29.
Related Documentation
- NMF Algorithm Implementation (to be created)
- Clustering Validation Metrics (created)
- Topic Modeling Experiments (to be created)
- Question Classification Pipeline (to be created)
Appendix: Formula Summary
UCI Coherence
\[C_{\text{UCI}}(T) = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \text{PMI}(w_i, w_j)\] \[\text{PMI}(w_i, w_j) = \log \frac{P(w_i, w_j) + \epsilon}{P(w_i) \cdot P(w_j)}\] \[P(w) = \frac{\text{\# windows containing } w}{\text{total windows}}\]UMass Coherence
\[C_{\text{UMass}}(T) = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \log \frac{D(w_i, w_j) + 1}{D(w_i)}\] \[D(w) = \text{\# documents containing } w\] \[D(w_i, w_j) = \text{\# documents containing both } w_i \text{ and } w_j\]Last updated: February 5, 2026