小编速览
2025年8月13日,来自罗马第一大学等机构的研究者指出,当前AI翻译系统在标准测试中表现过于出色,造成了“天花板效应”。他们提议,通过自动筛选更具挑战性的文本样本,构建更具辨别力的基准测试,帮助评估系统的实际能力。研究还比较了多种翻译难度评估方法,发现经过训练的模型在评估准确性上表现最佳。
The Problem with Evaluating State-of-the-Art AI Translation
评估尖端AI翻译系统的工作难题
In an August 13, 2025 paper, researchers Lorenzo Proietti, Stefano Perrella, and Roberto Navigli from Sapienza University (Sapienza NLP Group), along with Vilém Zouhar from ETH Zurich and Tom Kocmi from Cohere, highlighted a problem in evaluating state-of-the-art AI translation systems.
2025 年 8 月 13 日,来自罗马第一大学 Sapienza NLP 小组的 Lorenzo Proietti、Stefano Perrella 与 Roberto Navigli,联合苏黎世联邦理工学院的 Vilém Zouhar,以及 Cohere 公司的 Tom Kocmi 共同指出,当前评估最先进的 AI 翻译系统过程中,存在值得探讨的问题。
They explained that leading AI translation systems receive “near-perfect scores” on widely used benchmarks such as the WMT shared tasks, performing “close to human level.”
他们提到,主流AI 翻译系统在 WMT 共享任务等广泛采用的基准测试中可获得 “近乎完美的分数”,其性能表现已 “接近人类水平”。
According to the researchers, this is because current test sets are simply “too easy” for today’s models.
该团队进一步分析认为,对于如今先进 AI 翻译模型而言,当前测试集难度 “过低”。
While impressive on paper, this ceiling effect creates a challenge: if all systems look equally good, it becomes increasingly difficult for researchers to track progress and for enterprise buyers to make informed choices between vendors.
尽管 AI 翻译系统在测试数据上展现出较高完备性,但这种 “测试集失效” 引发的天花板效应,还带来了更现实的挑战:当所有系统在现有基准中表现趋同时,研究人员将难以精准跟踪研究进度,企业买家也无法依据测试结果在不同供应商之间做出明智选择。
The industry risks assuming AI translation is solved when, in fact, weaknesses remain hidden by overly easy test sets.
业界不慎陷入了AI 翻译核心问题已解决的误区;但该团队强调,过于简单的测试集实际上只是掩盖了系统的弱点。
Creating More Challenging Benchmarks
构建更具挑战性的基准测试
To address this, the researchers propose creating more discriminative benchmarks by automatically selecting harder samples.
为此,研究团队首先提出了针对性解决思路:通过算法自动筛选难度更高的文本样本,以此构建更具辨别力的评估基准。
Instead of simply judging translations, they suggest predicting how difficult a text will be to translate beforehand.
在此基础上,他们进一步指出评估不应单纯评判最终翻译结果,而应优先预测文本本身的翻译难度。
They formalize this as a new task called translation difficulty estimation — they define a text’s difficulty based on the expected quality of its translations (the lower the score, the higher the difficulty) — with a dedicated metric, difficulty estimation correlation (DEC), which measures how well systems rank texts by difficulty compared to human judgments.
研究团队将该思路形式化为“翻译难度估计” 的新任务:他们依据文本翻译的预期质量来定义其难度(即预期翻译质量得分越低,文本难度越高),并配套提出了专门的评估指标 ——“难度估计相关性(DEC)”。该指标用于衡量 AI 系统对文本难度的排序结果,与人类判断的吻合程度。
In practice, this means building test sets around texts that are genuinely challenging for AI translation systems, rather than confirming strengths on simpler cases. By identifying samples where AI translation models still struggle, it is possible to “expose their shortcomings and guide improvements in future iterations,” the researchers explained.
在实际操作中,这一整套方案指测试集围绕对 AI 翻译系统真正具有挑战性的文本展开,而非在简单文本案例中验证系统已具备的优势。研究团队对此解释道,通过主动筛选出 AI 翻译模型仍难以应对的文本样本,不仅能精准暴露模型的短板,更能为模型后续的迭代优化提供明确方向。
Which Difficulty Estimators Work Best?
哪种难度评估效果最好?
They compared four types of difficulty estimators:
研究团队比较了四类难度评估方法:
1
Heuristics — such as sentence length, word rarity, and syntactic complexity.
启发式评估:指通过句子长度、词语稀缺性、句法复杂性等可直接观测的文本特征判断难度。
2
Learned models — trained directly to predict difficulty, including their own Sentinel-src series.
经过训练的模型评估:指专门为预测翻译难度而训练的模型,其中包括研究团队自研的 Sentinel-src 系列。
3
LLM-as-a-Judge methods — large language models (LLMs) like GPT-4o or Cohere’s CommandA prompted to score difficulty.
『大语言模型』(LLM)评估:借助 GPT-4o、Cohere 的 CommandA 等大型语言模型,通过提示词引导其对文本难度进行评分。
4
Crowd-based approaches — which generate translations from several models and score them with reference-less metrics like XCOMET or MetricX.
多模型评估:从多个模型生成翻译结果后,采用 XCOMET、MetricX 等无参考指标对结果进行打分,间接反推文本难度。
They found that LLMs such as OpenAI’s GPT-4o and Cohere’s CommandA, when used as “judges,” performed poorly, in some cases even worse than simple length-based heuristics.
从对比结果来看:首先,以 OpenAI 的 GPT-4o、Cohere 的 CommandA 为代表的 LLM 评估方法,实际表现并不理想,在部分场景下甚至不如仅基于句子长度的简单启发式方法。
Traditional heuristics, including word rarity and syntactic complexity, also proved weak, failing to capture the nuances of translation difficulty.
其次,传统启发式评估(涵盖词汇稀缺性、句法复杂性)也暴露出明显短板,因其仅依赖表层文本特征,无法捕捉翻译难度背后的细微差异以及深层问题。
Crowd methods delivered stronger results but are computationally expensive and not practical for everyday use.
多模型评估虽效果更好,却受限于高昂的计算成本,难以满足日常评估的实用性需求。
By contrast, the learned models consistently delivered the best results.
综合来看,经过训练的模型评估器的表现最为突出。
Another notable finding is that humans and machines often disagree on what counts as “difficult” and do not struggle with the same texts, underscoring the importance of designing test sets that reflect both human and machine challenges.
另一个值得关注的研究发现是:人类与机器对 “困难” 文本的判断常存分歧,二者难以应对的文本类型也并不重合。这一现象进一步强调设计评估测试集时需同时覆盖人类与机器各自面临的挑战。
特别说明:本文仅用于学术交流,如有侵权请后台联系小编删除。