评估将PDF或文档图像转换为Markdown的OCR系统远比表面看起来复杂。与纯文本OCR不同,OCR转Markdown要求模型同时恢复内容、布局、阅读顺序和表示形式的选择。如今的基准测试试图通过字符串匹配、启发式对齐和特定格式规则的组合来评分,但实际上,这些方法经常将正确的输出误判为失败。
本文概述了为何OCR转Markdown的评估天生就是规定不足的,审视了常见的评估技术及其失败模式,指出了在两个广泛使用的基准测试中观察到的具体问题,并解释了为何尽管不完美,但目前使用LLM作为评估器是最实用的方法。
核心问题在于,OCR转Markdown并不存在单一的正确答案。多个输出可能同样有效:
从人类或下游系统的角度来看,这些输出是等价的。但从基准测试的角度看,它们通常不等价。
1. 基于字符串的指标 (编辑距离、精确匹配)
大多数OCR转Markdown基准测试依赖于规范化字符串比较或编辑距离。
局限性
这些指标奖励的是格式合规性,而非正确性。
2. 对顺序敏感的块匹配
一些基准测试将文档分割成块,并对顺序和接近度进行评分。
局限性
3. 通过LaTeX归一化进行公式匹配
侧重数学的基准测试通常期望公式以完整的LaTeX形式呈现。
局限性
这混淆了表示形式的选择与数学正确性。
4. 特定格式假设
基准测试隐含地编码了偏好的输出风格。
局限性
<sub>)会导致匹配失败。基准测试A: olmOCRBench
人工检查发现,几个子集嵌入了隐含的内容省略规则:
这些子集实际上评估的是选择性抑制能力,而不是OCR质量。
此外:
因此,分数在很大程度上取决于模型的输出理念是否符合基准测试的隐藏假设。
示例1
对于上图,某OCR2 模型正确预测了图像右侧的水印,但在真实数据标注中,因正确预测而受到惩罚。
{
"pdf": "headers_footers/ef5e1f5960b9f865c8257f9ce4ff152a13a2559c_page_26.pdf",
"page": 1,
"id": "ef5e1f5960b9f865c8257f9ce4ff152a13a2559c_page_26.pdf_manual_01",
"type": "absent",
"text": "Document t\\u00e9l\\u00e9charg\\u00e9 depuis www.cairn.info - Universit\\u00e9 de Marne-la-Vall\\u00e9e - - 193.50.159.70 - 20/03/2014 09h07. \\u00a9 S.A.C.", "case_sensitive": false, "max_diffs": 3, "checked": "verified", "first_n": null, "last_n": null, "url": "<https://hal-enpc.archives-ouvertes.fr/hal-01183663/file/14-RAC-RecitsDesTempsDHier.pdf>"}类型为 absent 意味着在预测数据中,该文本不应存在。
示例2
该基准测试也未考虑文档页脚中出现的文本。
例如在此文档中,根据真实数据,Alcoholics Anonymous\\u00ae 和 www.aa.org 不应出现在文档中,这是不正确的。
{
"pdf": "headers_footers/3754542bf828b42b268defe21db8526945928834_page_4.pdf",
"page": 1,
"id": "3754542bf828b42b268defe21db8526945928834_page_4_header_00",
"type": "absent",
"max_diffs": 0,
"checked": "verified",
"url": "<https://www.aa.org/sites/default/files/literature/PI%20Info%20Packet%20EN.pdf>",
"text": "Alcoholics Anonymous\\u00ae",
"case_sensitive": false, "first_n": null, "last_n": null
}
{
"pdf": "headers_footers/3754542bf828b42b268defe21db8526945928834_page_4.pdf",
"page": 1,
"id": "3754542bf828b42b268defe21db8526945928834_page_4_header_01",
"type": "absent",
"max_diffs": 0,
"checked": "verified",
"url": "<https://www.aa.org/sites/default/files/literature/PI%20Info%20Packet%20EN.pdf>",
"text": "www.aa.org",
"case_sensitive": false, "first_n": null, "last_n": null}基准测试B: OmniDocBench
OmniDocBench 表现出类似的问题,但范围更广:
在许多情况下,低分反映的是基准测试本身的缺陷,而非模型错误。
示例1
在上面的例子中,某OCR2模型预测为 5 g silica + 3 g Al$_2$O$_3$,但真实数据期望为 5g \\\\mathrm{\\\\ s i l i c a}+3g \\\\mathrm{\\\\ A l}*{2} \\\\mathrm{O*{3}} 。即使两者都正确,这也会将模型预测标记为错误。
完整的真实数据、预测和测试用例如下:
'pred': 'The collected eluant was concentrated by rotary evaporator to 1 ml. The extracts were finally passed through a final column filled with 5 g silica + 3 g Al$_2$O$_3$ to remove any co-extractive compounds that may cause instrumental interferences durin the analysis. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the first 18 ml of eluent was discarded and the rest were collected, which contains the analytes of interest. The extract was exchanged into n-hexane, concentrated to 1 ml to which 1 μg/ml of internal standard was added.'
'gt': 'The collected eluant was concentrated by rotary evaporator to 1 ml .The extracts were finally passed through a final column filled with $ 5g \\\\mathrm{\\\\ s i l i c a}+3g \\\\mathrm{\\\\ A l}*{2} \\\\mathrm{O*{3}} $ to remove any co-extractive compounds that may cause instrumental
interferences during the analysis. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the first 18 ml of eluent was discarded and the rest were collected, which contains the analytes of interest. The extract was exchanged into n - hexane, concentrated to 1 ml to which $ \\\\mu\\\\mathrm{g / ml} $ of internal standard was added.'示例2
在OmniDocBench中发现了明显更多的错误标注。在真实数据标注中,1 ml 里的 1 缺失了。
'text': 'The collected eluant was concentrated by rotary evaporator to 1 ml .The extracts were finally passed through a final column filled with $ 5g \\\\mathrm{\\\\ s i l i c a}+3g \\\\mathrm{\\\\ A l}*{2} \\\\mathrm{O*{3}} $ to remove any co-extractive compounds that may cause instrumental interferences during the analysis. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the first 18 ml of eluent was discarded and the rest were collected, which contains the analytes of interest. The extract was exchanged into n - hexane, concentrated to 1 ml to which $ \\\\mu\\\\mathrm{g / ml} $ of internal standard was added.'
```FINISHED原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。