Fun ways to teach reliability and validity

3/13/2024

Suppose model A gets a BLEU score one point higher than model B: Is that difference reliable? If you used a slightly different dataset for training and evaluation, would that one point difference still hold? Would the difference even survive running the same models on the same datasets but with different random seeds? In fields such as psychology and biology, it is standard to answer such questions using standardized statistical procedures to make sure that differences of interest are larger than some quantification of measurement noise.

For anyone with a background in statistics or a field where conclusions must be drawn on the basis of noisy data, this procedure is frankly shocking. When we come up with a new model in NLP and machine learning more generally, we usually look at some performance metric (one number), compare it against the same performance metric for a strong baseline model (one number), and if the new model gets a better number, we mark it in bold and declare it the winner.

0 Comments

Fun ways to teach reliability and validity

Leave a Reply.

Author

Archives

Categories