|
Summary:
|
This recommendation will describe subjective evaluation protocols which aim at quantifying the quality of text which is generated by machine-learning (ML) algorithms, in particular so-called “Large Language Models” (LLMs) from the field of generative Artificial Intelligence (AI), and which is used in passive or reactive applications. Applications targeted are e.g. machine summarization, or machine translation, or text synthesis from notes. In such applications, the text is generated by an ML model after having been triggered by the user, but the text itself is not part of the user-system interaction (such as in chatbots which are out-of-scope for the present proposal). Evaluation methods are expected to address both the extrinsic quality, i.e. in relation to the input text, as well as intrinsic quality which only addresses the ML-generated text. They should capture both the surface form (grammaticality, clarity, complexity, etc.), stylistic appropriateness (register, tone, formality), stylistic appropriateness (register, tone, formality) as well as the content (truthfulness, correctness) of the generated text. Evaluation protocols should account for the context in which passive text is consumed: user expectations may differ between near-real-time applications (e.g., live captions, simultaneous interpretation) and offline ones (e.g., document translation).Evaluation protocols should account for the context in which passive text is consumed: user expectations may differ between near-real-time applications (e.g., live captions, simultaneous interpretation) and offline ones (e.g., document translation). The display of the text, as well as the design of any applications used for the processing, are out of scope of the present Recommendation.
|