首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >正则表达式从下面的页面中提取段落

正则表达式从下面的页面中提取段落
EN

Stack Overflow用户
提问于 2015-03-13 07:40:10
回答 2查看 962关注 0票数 1

我使用iText从pdf中提取了以下文本,并将其放入字符串变量:

代码语言:javascript
复制
(1) A a, — al'-fah; of Hebrew origin; the first letter of the alphabet;
figurative only (from its use as a numeral) the first: — Alpha.
Often used (usually ajn an, before a vowel) also in composition
(as a contraction from (427) (a]neu,)) in the sense of privation;
so in many words beginning with this letter; occasionally in the
sense of union (as a contraction of (260) (a[ma)).
(2) ÆAarw>n, — ah-ar-ohn'; of Hebrew origin [Hebrew {175}
('Aharown)]; Aaron, the brother of Moses: — Aaron.
(3) ÆAbaddw>n, — ab-ad-dohn'; of Hebrew origin [Hebrew {11}
('abaddown)]; a destroying angel: — Abaddon.
(4) ajbarh>v, — ab-ar-ace'; from (1) (a) (as a negative particle) and (922)
(ba>rov); weightless, i.e. (figurative) not burdensome: — from
being burdensome.
(5) ÆAbba~, — ab-bah'; of Chaldee origin [Hebrew {2} ('ab (Chaldee))];
father (as a vocative): — Abba.
(6) &Abel, — ab'-el; of Hebrew origin [Hebrew {1893} (Hebel)]; Abel,
the son of Adam: — Abel.
(7) ÆAbia>, — ab-ee-ah'; of Hebrew origin [Hebrew {29} ('Abiyah)];
Abijah, the name of two Israelites: — Abia.
(8) ÆAbia>qar, — ab-ee-ath'-ar; of Hebrew origin [Hebrew {54}
('Ebyathar)]; Abiathar, an Israelite: — Abiathar.
(9) ÆAbilhnh>, — ab-ee-lay-nay'; of foreign origin [compare Hebrew {58}
('abel)]; Abilene, a region of Syria: — Abilene.
(10) ÆAbiou>d, — ab-ee-ood'; of Hebrew origin [Hebrew {31}
('Abiyhuwd)]; Abihud, an Israelite: — Abiud.

字符串中的段落以([0-9])开头,就像在(9)(5)中一样,我希望使用pagestring.split("regex")提取以这个字符序列开头的每个段落。能帮上忙吗?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-03-13 08:38:56

这避免了对嵌入在文本中的"(999)“进行拆分。它是基于这样的假设:一行结束在括号号前面,表示段落的开头。还请注意,示例文本从第一个括号号之前的没有文本生成一个空的“段落”,因此是if语句。

代码语言:javascript
复制
  String text = ...;
  String[] paras = text.split( "(?<=(^|\\n))\\(\\d+\\)" );
  for( String para: paras ){
      if( para.length() > 0 ){
          System.out.println( "Para: " + para );
      }
  }
票数 0
EN

Stack Overflow用户

发布于 2015-03-13 09:15:27

您可以在拆分方法中使用以下regex "[\n|.]\\([0-9]{1,2}\\)",它将从文本中提取所有段落(包括从0到99的数字):

代码语言:javascript
复制
String[] parts=st.split("[\n|.]\\([0-9]{1,2}\\)");

[\n|.]:只考虑的新段落--,而忽略了(n)中的(n)\\([0-9]{1,2}\\):在()中匹配任意组、一个两个数字。

下面是the working DEMO,给出了一个包含所有段落的数组。

有关regex使用的更多信息,请参见Java Regex Pattern

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/29027063

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档