1
$\begingroup$

I've encountered the term "accuracy" used differently across several evaluation contexts, and I want to clearly understand their mathematical and conceptual distinctions using consistent notation.

Consider a model ( p_\theta(y \mid x) ), given input ( x ), and let ( o^\star ) denote the correct output. Using indicator functions and expectations, here are three definitions of accuracy:

  1. Machine Learning (Classification Accuracy): $$ \text{Acc}*{ML}(x;\theta) = \mathbf{1}\left[ o^\star = \arg\max*{y} p_\theta(y \mid x) \right]. $$ Intuitively: Checks if the most probable prediction exactly matches the correct output.

  2. Statistical Accuracy (Expectation of correctness): $$ \text{Acc}*{Stats}(x;\theta) = \mathbb{E}*{y \sim p_\theta(\cdot \mid x)}[\mathbf{1}{y = o^\star}] = p_\theta(o^\star \mid x). $$ Intuitively: The probability that a single randomly sampled prediction from the model is correct.

  3. Generative Modeling (pass@k): Define a correctness checker ( g_x(y) \in {0,1} ) indicating if ( y ) is acceptable. Then the pass@k accuracy is: $$ p_{\text{succ}}(x) = \mathbb{E}*{y \sim p*\theta(\cdot \mid x)}[g_x(y)], \quad \mathbb{E}[\text{pass@}k(x)] = 1 - (1 - p_{\text{succ}}(x))^k. $$

    Intuitively: The probability that at least one out of ( k ) independently sampled predictions is correct.

Given these definitions, could someone clarify:

  • The explicit mathematical and conceptual distinctions among these three types of accuracy?
  • Under which specific conditions, if any, would these measures coincide?
  • Practical reasons and considerations behind choosing one definition of accuracy over another for different evaluation tasks?

cross: https://www.reddit.com/r/learnmachinelearning/

$\endgroup$
3
  • $\begingroup$ You missed a few $ signs in your MathJax $\endgroup$ Commented Nov 18 at 0:40
  • 1
    $\begingroup$ Also, I think your machine learning accuracy would be an estimate of “statistical” accuracy. If your want consistency of predictions you’ll have to do calibration analysis $\endgroup$ Commented Nov 18 at 0:41
  • $\begingroup$ Your mathametical expressions are very verbose and unclear. It makes the question incomprehensible. Your intuition: "The probability that ... is correct." seems to be the same for the three cases. So what is the difference that you are pointing to? At the same time it is a weird characterisation of accuracy. So many things/problems seem to be going on here. $\endgroup$ Commented 6 hours ago

1 Answer 1

0
$\begingroup$

I'm going to change the names a bit, because the differences are not machine-learning and statistical concepts, as both statistics and machine learning use both concepts.

(2) is essentially the "generalisation error" of a classifier system, it is the probability that the classifier will make a mistake when classifying an unseen example drawn from the same underlying distribution from which the training data were drawn. This is what both statisticians and machine learning practitioners would really want to know. However we can't measure this because we don't know what that distribution is (if we did, designing a classifier would be very much easier). So we have to use an estimator instead.

(1) is essentially the "empirical error", it is an error rate observed on a sample of data (usually a sample of unseen examples from the same distribution). It will be an unbiased estimate of the generalisation error, but it has a variance due to sampling. It may be a value that is higher than the true generalisation error (because by luck the sample contained more "easy" examples) or it could be lower (for the opposite reason). Both statisticians and machine learning communities use this to estimate (2). The bigger the sample, the lower the variance and the better the estimate of (2).

You should be able to find definitions of those terms in any good machine learning (or computational learning theory) text book, such as those by Bishop or Murphy (or Vapnik).

(3) is a special metric that is appropriate for evaluating generative systems such as LLMs where it is assumed the user can pick the best result from the sample of generated predictions that are made. It is a specialisation of (1) for a particular application.

When choosing the metric, it is probably better to focus on what are the important properties of this practical application and then determine whether a metric measures that, rather than to have recipes "use this metric for tasks like this" because that encourages practitioners not to think deeply about the needs of the particular application (or the properties of the metric and why is is appropriate or not).

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.