Topic Modeling Coherence Metrics

Overview

This document provides a technical and scientific summary of the topic coherence metrics implemented in the math_investigation/topic_modeling/coherence.py module. Topic coherence metrics evaluate the semantic quality of topics discovered by Non-negative Matrix Factorization (NMF) and other topic modeling algorithms.

Unlike clustering metrics that focus on partition quality, coherence metrics assess whether the top words in a topic are semantically related and interpretable by humans. These metrics are crucial for:

Selecting the optimal number of topics
Comparing different topic modeling algorithms
Tuning hyperparameters (e.g., regularization)
Evaluating topic interpretability

What is Topic Coherence?

Definition: Topic coherence measures the degree of semantic similarity between high-scoring words in a topic.

Intuition: A coherent topic should contain words that frequently co-occur in documents and are semantically related. For example:

High coherence: {car, vehicle, drive, road, engine}
Low coherence: {car, apple, theory, blue, database}

Implemented Metrics

1. UCI Coherence

Mathematical Definition

UCI (University of California, Irvine) coherence is based on Pointwise Mutual Information (PMI) computed over sliding windows:

\[C_{\text{UCI}} = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \log \frac{P(w_i, w_j) + \epsilon}{P(w_i) \cdot P(w_j)}\]

where:

$n$ = number of top words in the topic (typically 10-20)
$w_i, w_j$ = words in the topic
$P(w_i)$ = probability of word $w_i$ appearing in a sliding window
$P(w_i, w_j)$ = probability of words $w_i$ and $w_j$ co-occurring in a sliding window
$\epsilon$ = smoothing factor (typically $10^{-10}$) to avoid log(0)

Sliding Window Approach

The probabilities are estimated using a sliding window over documents:

Window size: Typically 10 tokens (configurable parameter)
Count word occurrences: Each word presence in a window is counted once
Count co-occurrences: Each word pair co-occurrence in a window is counted once

\[P(w_i) = \frac{\text{\# windows containing } w_i}{\text{total windows}}\] \[P(w_i, w_j) = \frac{\text{\# windows containing both } w_i \text{ and } w_j}{\text{total windows}}\]

Interpretation

Range: $(-\infty, +\infty)$ in theory, typically $[-15, 15]$ in practice
Higher is better: Positive values indicate words co-occur more than expected by chance
Negative values: Words co-occur less than expected (poor coherence)
Zero: Independent co-occurrence (random association)

PMI Intuition

PMI measures the association strength between word pairs:

\[\text{PMI}(w_i, w_j) = \log \frac{P(w_i, w_j)}{P(w_i) \cdot P(w_j)}\]

Positive PMI: Words co-occur more than expected → related
Zero PMI: Words co-occur as expected → independent
Negative PMI: Words co-occur less than expected → unrelated

When to Use

Best for: External corpus validation (using original documents)
Advantages:
- Sensitive to local word associations
- Reflects human perception of coherence
- Works with raw text (no preprocessing artifacts)
Use cases:
- Selecting optimal number of topics
- Comparing topic models on same corpus
- Evaluating topic interpretability

Computational Complexity

Window extraction: $O(D \cdot L)$ where $D$ = documents, $L$ = average document length
Coherence computation: $O(T \cdot n^2)$ where $T$ = topics, $n$ = top words per topic
Total: $O(D \cdot L + T \cdot n^2)$

2. UMass Coherence

Mathematical Definition

UMass (University of Massachusetts) coherence uses document-level co-occurrence with conditional probabilities:

\[C_{\text{UMass}} = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \log \frac{D(w_i, w_j) + 1}{D(w_i)}\]

where:

$D(w_i)$ = number of documents containing word $w_i$
$D(w_i, w_j)$ = number of documents containing both words $w_i$ and $w_j$
Smoothing factor +1 prevents log(0)

This can be interpreted as:

\[C_{\text{UMass}} = \frac{2}{n(n-1)} \sum_{i<j} \log P(w_j | w_i)\]

where $P(w_j

w_i)$ is the conditional probability of seeing $w_j$ given $w_i$ appears.

Document Co-occurrence Approach

Unlike UCI, UMass uses document-level co-occurrence:

Binary presence: Check if word appears in document (frequency doesn’t matter)
Document frequency: Count documents containing each word
Co-document frequency: Count documents containing both words

\[P(w_j | w_i) = \frac{D(w_i, w_j) + 1}{D(w_i)}\]

where $D(w)$ represents document frequency (number of documents containing the word).

Interpretation

Range: $(-\infty, 0]$ typically
Values closer to 0 are better: Less negative = higher coherence
Negative values: Indicates conditional probability < 1 (always true unless perfect co-occurrence)
Very negative values (< -10): Poor topic coherence

Why Negative Values?

Since $D(w_i, w_j) \leq D(w_i)$, we have:

\[\frac{D(w_i, w_j) + 1}{D(w_i)} \leq \frac{D(w_i) + 1}{D(w_i)} \approx 1\]

Thus $\log(\cdot) \leq 0$ in most cases. The metric measures how much less than 1 this conditional probability is.

When to Use

Best for: Quick evaluation using term-document matrix
Advantages:
- Computationally efficient (no sliding window)
- Works directly with document-term matrix
- Stable with small corpora
- Correlates well with human judgments
Use cases:
- Fast topic quality assessment
- Grid search over hyperparameters
- Large-scale topic modeling

Computational Complexity

Binary matrix conversion: $O(D \cdot V)$ where $V$ = vocabulary size
Coherence computation: $O(T \cdot n^2)$
Total: $O(D \cdot V + T \cdot n^2)$
Generally faster than UCI (no window sliding)

Comparison: UCI vs UMass

Aspect	UCI Coherence	UMass Coherence
Co-occurrence level	Sliding window (local)	Document-level (global)
Value range	$(-\infty, +\infty)$, often $[-15, 15]$	$(-\infty, 0]$, often $[-20, 0]$
Interpretation	Higher = better	Closer to 0 = better
Requires	Raw documents + tokenization	Document-term matrix
Computation	Slower (window sliding)	Faster (matrix operations)
Sensitivity	Local context, word ordering	Global co-occurrence patterns
Corpus size	Works with small corpora	Needs reasonable document counts
Human correlation	High (0.6-0.8)	High (0.6-0.7)

Theoretical Foundations

Pointwise Mutual Information (PMI)

PMI is an information-theoretic measure of association:

\[\text{PMI}(x, y) = \log \frac{P(x, y)}{P(x) \cdot P(y)} = \log \frac{P(x|y)}{P(x)} = \log \frac{P(y|x)}{P(y)}\]

Interpretation:

Measures how much more (or less) likely $x$ and $y$ are to co-occur than if they were independent
Related to mutual information: $I(X;Y) = \sum \sum P(x,y) \cdot \text{PMI}(x,y)$

Conditional Probability Approach

UMass coherence uses conditional probability:

\[P(w_j | w_i) = \frac{P(w_i, w_j)}{P(w_i)}\]

This measures: “If I see word $w_i$, what’s the probability I’ll also see $w_j$?”

High conditional probabilities → strong word associations → coherent topics

Usage Guidelines

Selecting Number of Topics

Approach: Compute coherence for different values of $k$ (number of topics)

from math_investigation.topic_modeling.coherence import uci_coherence, umass_coherence

# Range of topics to test
k_values = [3, 5, 7, 10, 15, 20]
coherence_scores = {'k': [], 'uci': [], 'umass': []}

for k in k_values:
    # Train NMF with k topics
    nmf = NMF(n_components=k)
    W = nmf.fit_transform(X)
    H = nmf.components_
    
    # Get top words per topic
    topics = extract_top_words(H, feature_names, n_words=10)
    
    # Compute coherence
    uci_scores = uci_coherence(topics, documents, window_size=10)
    umass_scores = umass_coherence(topics, X, feature_names)
    
    coherence_scores['k'].append(k)
    coherence_scores['uci'].append(np.mean(list(uci_scores.values())))
    coherence_scores['umass'].append(np.mean(list(umass_scores.values())))

# Select k with highest coherence
optimal_k_uci = k_values[np.argmax(coherence_scores['uci'])]
optimal_k_umass = k_values[np.argmax(coherence_scores['umass'])]  # closest to 0

Plot: Coherence vs. number of topics (look for peak or elbow)

Evaluating Individual Topics

# Per-topic coherence for interpretation
topics = extract_top_words(H, feature_names, n_words=10)
uci_scores = uci_coherence(topics, documents)

for topic_id, score in uci_scores.items():
    print(f"Topic {topic_id}: {score:.3f}")
    print(f"  Top words: {topics[topic_id]}")
    print()

# Filter out low-coherence topics
good_topics = {tid: words for tid, words in topics.items() 
               if uci_scores[tid] > threshold}

Hyperparameter Tuning

Use coherence metrics for:

NMF regularization: Test different $\alpha$ and $\beta$ values
Initialization methods: Compare random, NNDSVD, etc.
Preprocessing: Compare different stopword lists, min_df/max_df

Implementation Details

Module: `math_investigation/topic_modeling/coherence.py`

Design Principles

From-scratch implementation: No scikit-learn or Gensim dependencies
Educational focus: Clear, readable code with explicit formulas
Efficient NumPy operations: Vectorized computations where possible
Type hints and documentation: Comprehensive docstrings

Key Functions

def uci_coherence(
    topics: dict[int, list[str]],
    documents: list[str],
    window_size: int = 10,
) -> dict[int, float]:
    """
    Args:
        topics: {topic_idx: [top_words]}
        documents: Original document texts
        window_size: Sliding window size (default: 10)
    
    Returns:
        {topic_idx: coherence_score}
    """

def umass_coherence(
    topics: dict[int, list[str]],
    doc_term_matrix: np.ndarray,
    feature_names: list[str],
) -> dict[int, float]:
    """
    Args:
        topics: {topic_idx: [top_words]}
        doc_term_matrix: Document-term matrix (will be binarized)
        feature_names: Vocabulary in order
    
    Returns:
        {topic_idx: coherence_score}
    """

Preprocessing Steps

Both metrics include:

Tokenization: Lowercase, remove punctuation, extract words
Stopword removal: Filter common words using STOPWORDS set
Minimum word length: 2+ characters (configurable)

Optimization Considerations

For large corpora:

UCI: Consider sampling documents or limiting window count
UMass: Sparse matrix operations for efficiency
Caching: Store co-occurrence counts if computing multiple coherence values

Validation and Benchmarks

Human Correlation Studies

Research shows coherence metrics correlate with human topic interpretability:

Metric	Correlation with Human Judgments
UCI Coherence	$r = 0.65 - 0.80$
UMass Coherence	$r = 0.60 - 0.75$
C_V (not implemented)	$r = 0.70 - 0.85$

Source: Röder et al. (2015)

Expected Value Ranges

Typical coherence values for real-world topics:

Quality	UCI Coherence	UMass Coherence
Excellent	> 5	> -2
Good	2 to 5	-2 to -5
Moderate	0 to 2	-5 to -10
Poor	< 0	< -10

Note: These are approximate guidelines; actual ranges depend on corpus characteristics

Limitations and Considerations

UCI Coherence

Limitations:

Sensitive to window size parameter
Computationally expensive for large corpora
Requires well-preprocessed text
May be affected by document length distribution

Best practices:

Use window_size=10 as default (validated in literature)
Ensure consistent tokenization
Remove very frequent and very rare words
Consider downsampling for very large corpora

UMass Coherence

Limitations:

Less sensitive to local context
Assumes binary word presence (ignores frequency)
Can be unstable with very small corpora
Biased toward frequent words

Best practices:

Ensure adequate corpus size (100+ documents minimum)
Use appropriate min_df/max_df thresholds
Validate results with multiple metrics
Consider document length normalization

Advanced Topics

C_V Coherence (Not Implemented)

C_V coherence combines multiple measures and typically achieves highest human correlation:

\[C_V = \text{cosine}(\vec{s}_{\text{NPMI}}, \vec{s}_{\text{context}})\]

where NPMI is normalized PMI and context vectors capture semantic relationships.

Why not implemented: Requires external semantic models (word embeddings) and is more complex. UCI and UMass are sufficient for most TFG applications.

Topic Diversity Metrics

In addition to coherence, topic diversity measures uniqueness:

\[\text{Diversity} = \frac{1}{T} \sum_{t=1}^{T} \frac{|\text{unique words in topic } t|}{n}\]

High diversity + high coherence = optimal topic model

Integration with Math Investigation

Question Classification Pipeline

NMF Topic Modeling: Decompose question embeddings
Coherence Evaluation: Select optimal number of topics
Topic Assignment: Map questions to dominant topic
Difficulty Clustering: Within-topic K-Means/FCM clustering

Example workflow:

# scripts/train_difficulty_centroids.py integration
from math_investigation.topic_modeling.coherence import uci_coherence
from math_investigation.clustering.kmeans import KMeans

# 1. Topic modeling
topics = extract_top_words(H, feature_names, n_words=10)
coherence = uci_coherence(topics, questions)

# 2. Filter high-coherence topics
good_topics = [t for t in range(n_topics) if coherence[t] > threshold]

# 3. Cluster within topics
for topic_id in good_topics:
    topic_questions = W[:, topic_id] > threshold
    kmeans = KMeans(n_clusters=3)  # Easy/Medium/Hard
    labels = kmeans.fit_predict(X[topic_questions])

Chatbot Enhancement

Topic coherence helps the chatbot:

Question routing: High-coherence topics → reliable classification
Content organization: Group similar questions by topic
Quality assessment: Monitor topic coherence over time as new questions arrive

References

Academic Sources

UCI Coherence: Newman, D., Lau, J.H., Grieser, K., & Baldwin, T. (2010). “Automatic evaluation of topic coherence”. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, 100-108.
UMass Coherence: Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). “Optimizing semantic coherence in topic models”. Proceedings of EMNLP 2011, 262-272.
Coherence Survey: Röder, M., Both, A., & Hinneburg, A. (2015). “Exploring the space of topic coherence measures”. Proceedings of WSDM 2015, 399-408.
PMI Theory: Church, K.W., & Hanks, P. (1990). “Word association norms, mutual information, and lexicography”. Computational Linguistics, 16(1), 22-29.

NMF Algorithm Implementation (to be created)
Clustering Validation Metrics (created)
Topic Modeling Experiments (to be created)
Question Classification Pipeline (to be created)

Appendix: Formula Summary

UCI Coherence

\[C_{\text{UCI}}(T) = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \text{PMI}(w_i, w_j)\] \[\text{PMI}(w_i, w_j) = \log \frac{P(w_i, w_j) + \epsilon}{P(w_i) \cdot P(w_j)}\] \[P(w) = \frac{\text{\# windows containing } w}{\text{total windows}}\]

UMass Coherence

\[C_{\text{UMass}}(T) = \frac{2}{n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \log \frac{D(w_i, w_j) + 1}{D(w_i)}\] \[D(w) = \text{\# documents containing } w\] \[D(w_i, w_j) = \text{\# documents containing both } w_i \text{ and } w_j\]

Last updated: February 5, 2026

Topic Modeling Coherence Metrics

Overview

What is Topic Coherence?

Implemented Metrics

1. UCI Coherence

Mathematical Definition

Sliding Window Approach

Interpretation

PMI Intuition

When to Use

Computational Complexity

2. UMass Coherence

Mathematical Definition

Document Co-occurrence Approach

Interpretation

Why Negative Values?

When to Use

Computational Complexity

Comparison: UCI vs UMass

Theoretical Foundations

Pointwise Mutual Information (PMI)

Conditional Probability Approach

Usage Guidelines

Selecting Number of Topics

Evaluating Individual Topics

Hyperparameter Tuning

Implementation Details

Module: math_investigation/topic_modeling/coherence.py

Design Principles

Key Functions

Preprocessing Steps

Optimization Considerations

Validation and Benchmarks

Human Correlation Studies

Expected Value Ranges

Limitations and Considerations

UCI Coherence

UMass Coherence

Advanced Topics

C_V Coherence (Not Implemented)

Topic Diversity Metrics

Integration with Math Investigation

Question Classification Pipeline

Chatbot Enhancement

References

Academic Sources

Related Documentation

Appendix: Formula Summary

UCI Coherence

UMass Coherence

Module: `math_investigation/topic_modeling/coherence.py`