I've encountered the term "accuracy" used differently across several evaluation contexts, and I want to clearly understand their mathematical and conceptual distinctions using consistent notation.
Consider a model ( p_\theta(y \mid x) ), given input ( x ), and let ( o^\star ) denote the correct output. Using indicator functions and expectations, here are three definitions of accuracy:
Machine Learning (Classification Accuracy): $$ \text{Acc}*{ML}(x;\theta) = \mathbf{1}\left[ o^\star = \arg\max*{y} p_\theta(y \mid x) \right]. $$ Intuitively: Checks if the most probable prediction exactly matches the correct output.
Statistical Accuracy (Expectation of correctness): $$ \text{Acc}*{Stats}(x;\theta) = \mathbb{E}*{y \sim p_\theta(\cdot \mid x)}[\mathbf{1}{y = o^\star}] = p_\theta(o^\star \mid x). $$ Intuitively: The probability that a single randomly sampled prediction from the model is correct.
Generative Modeling (pass@k): Define a correctness checker ( g_x(y) \in {0,1} ) indicating if ( y ) is acceptable. Then the pass@k accuracy is: $$ p_{\text{succ}}(x) = \mathbb{E}*{y \sim p*\theta(\cdot \mid x)}[g_x(y)], \quad \mathbb{E}[\text{pass@}k(x)] = 1 - (1 - p_{\text{succ}}(x))^k. $$
Intuitively: The probability that at least one out of ( k ) independently sampled predictions is correct.
Given these definitions, could someone clarify:
- The explicit mathematical and conceptual distinctions among these three types of accuracy?
- Under which specific conditions, if any, would these measures coincide?
- Practical reasons and considerations behind choosing one definition of accuracy over another for different evaluation tasks?