首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用PDF OCR技术消除转换后文本中的缺陷

使用PDF OCR技术消除转换后文本中的缺陷
EN

Unix & Linux用户
提问于 2018-10-05 17:55:46
回答 2查看 66关注 0票数 1

我转换PDF文件使用OCR PDF阅读器。最初的文本是PDF文件中的图像,而PDF则使用OCR将其转换为文本,转换后的问题是文本没有正确对齐,似乎所有的字和行都发生了移位。样本文本

代码语言:javascript
复制
  biochemistry can be divided in three fields; molecular genetics, protein science and metabolism. Over the last decades 
of the 20th century, biochem
istry has through these three disciplines becom
e successful at explaining living processes. Almost all areas o
f the life sciences are being uncovered and developed by biochemical methodology and research.[2] Biochemistry focuses on unde
rstanding how biolog
ical molecules give 
rise to the processes that occur within living cells and
 between cells,[3] which
 in turn relates greatly to the study and understanding of 
, organs, and organism structure and function[4]

Biochemistry is closely related to mol
ecular biology, the study of the molecular mechanisms by which geneti
c information encoded in DNA is able to result in the processes of life.[5]

Much of biochemistry deals with the structu
res, 
 an
d interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.[6] The chemistry of the cell also depends on the 
 of smaller molecules and ions. Th
ese can be inorganic, for example water and metal ions, or organic, for example the amino acids, which are used to synthesi
ze proteins.[7]
 The mechanisms by which cells harness energy from their environment via chemical reactions are known as metabolism. The findings of biochemistry are applied primarily in medicine, nutrition, and agriculture. In medicine, b
iochemists investigate the causes and cures of diseases.[8] In nutrition, they study how to maintain health wellness and study the effects of nutritional deficiencies.[9] In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve crop cultivation, crop storage and pest control.

这个问题还有几个字被切成两半。我能做些什么来修复文本,使其可读性好吗?

EN

回答 2

Unix & Linux用户

发布于 2018-10-05 18:00:57

您可以使用awk线性来删除额外的回车,如下所示:

代码语言:javascript
复制
awk '{gsub(/\n/,""); gsub(/\r/,""); print}' RS='' file

biochemistry can be divided in three fields; molecular genetics, protein science and metabolism. Over the last decades of the 20th century, biochemistry has through these three disciplines become successful at explaining living processes. Almost all areas of the life sciences are being uncovered and developed by biochemical methodology and research.[2] Biochemistry focuses on understanding how biological molecules give rise to the processes that occur within living cells and between cells,[3] which in turn relates greatly to the study and understanding of , organs, and organism structure and function[4]
Biochemistry is closely related to molecular biology, the study of the molecular mechanisms by which genetic information encoded in DNA is able to result in the processes of life.[5]
Much of biochemistry deals with the structures,  and interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.[6] The chemistry of the cell also depends on the  of smaller molecules and ions. These can be inorganic, for example water and metal ions, or organic, for example the amino acids, which are used to synthesize proteins.[7] The mechanisms by which cells harness energy from their environment via chemical reactions are known as metabolism. The findings of biochemistry are applied primarily in medicine, nutrition, and agriculture. In medicine, biochemists investigate the causes and cures of diseases.[8] In nutrition, they study how to maintain health wellness and study the effects of nutritional deficiencies.[9] In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve crop cultivation, crop storage and pest control.

gsub函数有以下形式:

代码语言:javascript
复制
gsub(regexp, replacement [, target])

这与子函数类似,但gsub替换了它能找到的所有最长、最左边、不重叠的匹配子字符串。gsub中的“g”代表“全局”,意思是到处替换。

代码语言:javascript
复制
gsub(/\n/,"") replaces all newline occurrences within a string with non for all input text. 

gsub(/\r/,"") replace all carriage return (ASCII code 13) occurrences with non for all input text. 
票数 1
EN

Unix & Linux用户

发布于 2018-10-05 18:09:41

也许还有改进的余地,但这里有一个开始:

代码语言:javascript
复制
perl -0777 -ne 's/([^ ])$\\n/\1/g; s/\\n/ /g; print' < input | fmt

它使用perl组合换行符--如果行的结尾为空白,则为连续行,否则将换行符全部剥离,然后通过fmt将输出数据输送到管道中,以便将长行中断。

票数 1
EN
页面原文内容由Unix & Linux提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://unix.stackexchange.com/questions/473522

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档