我有PDF文件,我必须从其中提取某些段落。我使用PDF阅读器gem将pdf转换成文本文件,现在我尝试使用正则表达式从文本中提取段落。
我的文本在转换后看起来像这样
48-PINTSOPI(12x20/0.5mm pitch)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGENERAL描述\n提供1Gx8位,K9K8是8G位NAND闪存备忘,有256 M位备用。它的NAND单元为固态应用提供了最有效的解决方案,可以在(2K+64)Byte\n页面上执行典型的200 s程序操作,在(128K+4K)Byte块上执行典型1.5ms的擦除操作。数据寄存器中的数据可以按每字节25 n(K9NBG)循环时间读出\n。I/O引脚用作地址和数据输入/输出以及com-\n\n输入的端口。片上写控制器自动省略所有的程序和擦除功能,包括脉冲重复,必要时,\n内部验证和数据增强。即使是令状电子密集型系统也可以通过提供实时映射算法来利用K9K8G08U0M扩展的100 K程序/电子周期的可靠性。K9K8G08U0M是大型非易失性存储appl应用的最佳解决方案,例如固态文件存储和其他需要非易失性的portabl应用程序。标准的TSOPI包中也有两块8Gb的双芯片选择的超高密度解决方案,另一种超高密度解决方案有两个16 8Gb的TSOPI包和四个芯片选择。
我希望将文本从一般描述提取到段落末尾,其中我们有多个新行(至少3 \n)。我实现了以下方法,但它只能从段落中提取第一行。
def extract_desc(page_text)
desc_headings = ['GENERAL DESCRIPTION', 'General Description',
'DESCRIPTION', 'Description']
desc_regex_str = "(#{desc_headings.join('|')})[\s\n]{0,4}?((.+)\n\n)"
desc_regex = RE2::Regexp.new(desc_regex_str)
matched_desc_text = desc_regex.match(page_text)
desc = matched_desc_text[0] if matched_desc_text && matched_desc_text.size > 1
desc
end我怎样才能摘录这一段?上面的文本只是一个片段,我在描述的上面和下面都有其他的文本。如何在“一般描述”、“一般描述”、“描述”、“描述”和“at \n\n”之间提取文本。
发布于 2020-10-22 11:19:11
您可以使用
/\b(?:General Description|Description).*?(?=\n{3}|\z)/im见regex演示。或者,更快,但更详细:
/\b(?:General Description|Description).*(?:\n(?!\n{2}).*)*/i详细信息
(?im) -不区分大小写(i)和多行(m,点现在匹配换行符)模式\b -一个单词边界(?:General Description|Description) -一个非捕获组,它匹配两个字符串,General Description或Description中的任何一个。.*? -任何零或多个字符,尽可能少(?=\n{3}|\z) -一个正的前瞻性,它匹配一个位置,紧跟3个换行符或字符串的结尾。第二个模式以匹配的.*(?:\n(?!\n{2}).*)*结束
.* -除换行字符以外的任何零或多个字符(因为没有m修饰符),尽可能多。(?:\n(?!\n{2}).*)* -一个换行符的零或多个序列,后面没有两个换行符,然后是行的其余部分。请参阅Ruby演示
def extract_desc(page_text)
desc_headings = ['General Description', 'Description']
desc_regex = /(?im)\b(?:#{desc_headings.join('|')}).*?(?=\n{3}|\z)/
page_text[desc_regex]
end
text = "48 - Pin TSOP I (12 x 20 / 0.5 mm pitch)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGENERAL DESCRIPTION\n Offered in 1G x 8bit, the K9K8 is a 8G-bit NAND Flash Memo ry with spare 256M-bit. Its NAND cell provides the most cost-\n\n effective solution for the solid state application marke t. A program operation can be performed in typical 200 µs on the (2K+64)Byte\n page and an erase operation can be performed in typical 1.5ms on a (128K+4K)Byte block. Data in the data register can be read out\n at 25ns(K9NBG) cycle time per Byte. The I/O pins serve as the ports for address and data input/output as well as com-\n\n mand input. The on-chip write controller aut omates all program and erase functions in cluding pulse repetition, where required, and\n internal verification and margining of data. Even the writ e-intensive systems can take advantage of the K9K8G08U0M ′s extended\n reliability of 100K program/eras e cycles by providing ECC(Error Correc) with real time mapping-out algorithm. The\n\n K9K8G08U0M is an optimum solution for large nonvolatile storage appl ications such as solid state file storage and other portabl e\n applications requiring non-volatility.\n An ultra high density solution having two 8Gb stacked with twochip selects is also available in standard TSOPI package and another\n\n ultra high density solution having two 16Gb TSOPI package stacked with four chip selects is also available in TSOPI-DSP.\n\n\n\n\n\n\n\ntsopi dhf ghghgfhggfg hhhdhdggdj...."
p extract_desc(text)发布于 2020-10-22 11:00:01
也许split就足够了:
> text = "48 - Pin TSOP I (12 x 20 / 0.5 mm pitch)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGENERAL DESCRIPTION\n Offered in 1G x 8bit, the K9K8 is a 8G-bit NAND Flash Memo ry with spare 256M-bit. Its NAND cell provides the most cost-\n\n effective solution for the solid state application marke t. A program operation can be performed in typical 200 µs on the (2K+64)Byte\n page and an erase operation can be performed in typical 1.5ms on a (128K+4K)Byte block. Data in the data register can be read out\n at 25ns(K9NBG) cycle time per Byte. The I/O pins serve as the ports for address and data input/output as well as com-\n\n mand input. The on-chip write controller aut omates all program and erase functions in cluding pulse repetition, where required, and\n internal verification and margining of data. Even the writ e-intensive systems can take advantage of the K9K8G08U0M ′s extended\n reliability of 100K program/eras e cycles by providing ECC(Error Correc) with real time mapping-out algorithm. The\n\n K9K8G08U0M is an optimum solution for large nonvolatile storage appl ications such as solid state file storage and other portabl e\n applications requiring non-volatility.\n An ultra high density solution having two 8Gb stacked with twochip selects is also available in standard TSOPI package and another\n\n ultra high density solution having two 16Gb TSOPI package stacked with four chip selects is also available in TSOPI-DSP.\n\n\n\n\n\n\n\ntsopi dhf ghghgfhggfg hhhdhdggdj...."
> text.split(/\n{3,}/)[1]
=> "GENERAL DESCRIPTION\n Offered in 1G x 8bit, the K9K8 is a 8G-bit NAND Flash Memo ry with spare 256M-bit. Its NAND cell provides the most cost-\n\n effective solution for the solid state application marke t. A program operation can be performed in typical 200 µs on the (2K+64)Byte\n page and an erase operation can be performed in typical 1.5ms on a (128K+4K)Byte block. Data in the data register can be read out\n at 25ns(K9NBG) cycle time per Byte. The I/O pins serve as the ports for address and data input/output as well as com-\n\n mand input. The on-chip write controller aut omates all program and erase functions in cluding pulse repetition, where required, and\n internal verification and margining of data. Even the writ e-intensive systems can take advantage of the K9K8G08U0M ′s extended\n reliability of 100K program/eras e cycles by providing ECC(Error Correc) with real time mapping-out algorithm. The\n\n K9K8G08U0M is an optimum solution for large nonvolatile storage appl ications such as solid state file storage and other portabl e\n applications requiring non-volatility.\n An ultra high density solution having two 8Gb stacked with twochip selects is also available in standard TSOPI package and another\n\n ultra high density solution having two 16Gb TSOPI package stacked with four chip selects is also available in TSOPI-DSP."https://stackoverflow.com/questions/64480008
复制相似问题