Paper: Density Measures for Language Generation
Authors: Jon Kleinberg, Fan Wei (Cornell University)
Publication Date: April 19, 2025
Problem Background
Large language models (LLMs) excel at generating fluent text but struggle with balancing breadth (diversity of generated content across the target language) and validity (ensuring generated content belongs to the target language). The Kleinberg-Mullainathan (KM) algorithm achieves generation in the limit—where generated strings eventually belong to the target language ($K$)—but often produces outputs with density approaching zero in ($K$), limiting diversity. This resembles “mode collapse” (overly similar outputs) and “hallucination” (invalid outputs) in LLMs. This paper introduces density measures to quantify this trade-off and proposes an optimized algorithm to address it.
Proposed Method: Density Measures and Optimized Algorithm
Density Measures
- Definition: For a target language ($K$) ordered as ( ${\ell_1, \ell_2, \ldots}$ ) and an algorithm’s output set ($O$):
- Upper density: $$\mu_{\text{up}}(O, K) = \limsup_{N \to \infty} \frac{|O \cap \{\ell_1, \ldots, \ell_N\}|}{N}$$
- Lower density: $$\mu_{\text{low}}(O, K) = \liminf_{N \to \infty} \frac{|O \cap \{\ell_1, \ldots, \ell_N\}|}{N}$$
- Purpose: These measure the proportion of ($K$) covered by ($O$), with lower density ensuring a minimum breadth.
Theoretical Framework
- KM Model: An adversarial game where an algorithm generates strings based on observed strings ($S_t = {w_1, \ldots, w_t}$) from an unknown ($K$), aiming for generation in the limit.
- KM Limitation: Its nested chain approach results in zero-density outputs, reducing diversity.
New Algorithm
- Goal: Achieve generation in the limit with $\mu_{\text{low}}(O, K) \geq \frac{1}{8}$.
- Mechanisms:
- Dynamic Adjustment: Flexibly selects languages based on ($S_t$).
- Fallback Mechanism: Switches to larger languages when needed.
- Token System: Allocates tokens to control output frequency.
- Tree Structure: Uses a dynamic forest to organize and navigate languages.
Key Questions and Results
- Can positive lower density be achieved? Yes, the new algorithm ensures $\mu_{\text{low}}(O, K) \geq \frac{1}{8}$.
- Is zero density inevitable? No, the algorithm counters this.
- Index-based generation: Density oscillates but avoids consistently approaching zero.
Experimental Results (Theoretical)
- Guarantees $\mu_{\text{low}}(O, K) \geq \frac{1}{8}$, outperforming the KM algorithm.
- Introduces “infinite perfect towers” to analyze low-density cases.
Summary
This paper quantifies the breadth-validity trade-off using density measures and presents an algorithm that ensures a lower density of at least $\frac{1}{8}$, enhancing diversity and validity in language generation.