我看过这个线程:Regex to find all sentences of text?,但似乎无法用它来解决我的确切情况。下面是我正在研究的文本:
import regex as re
sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )
phrase = """For necessary expenses of the Office of Inspector
General, including employment pursuant to the Inspector
General Act of 1978 (Public Law 95–452; 5 U.S.C. App.),
$99,912,000, including such sums as may be necessary for
contracting and other arrangements with public agencies
and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5
U.S.C. App.), and including not to exceed $125,000 for
certain confidential operational expenses, including the
payment of informants, to be expended under the direction
of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and
section 1337 of the Agriculture and Food Act of 1981. For necessary expenses of the Office of the General
23 Counsel, $45,390,000."""
phrase = phrase.replace("\n", "")
sentence.findall(phrase)
# outputs:
['For necessary expenses of the Office of Inspector General, including employment pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
'App.), $99,912,000, including such sums as may be necessary for contracting and other arrangements with public agencies and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
'App.), and including not to exceed $125,000 for certain confidential operational expenses, including the payment of informants, to be expended under the direction of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. ',
'App.) and section 1337 of the Agriculture and Food Act of 1981. ']在这种情况下,在这个长短语中只有两个实际的句子。第一个问题是:
支付监察主任办公室的必要费用,包括根据1978年“监察主任法”(公法95-452;5 U.S.C. App.)雇用的费用,99,912,000美元,包括根据1978年“监察主任法”第6(a)(9)节与公共机构和私人签订合同和其他安排所需的款项(公法95-452);5 U.S.C. App),其中包括不超过125 000美元的某些机密业务费用,包括根据1978年“监察主任法”(公法95-452;5 U.S.C. App),在监察主任的指导下支付举报人的费用。以及1981年“农业和粮食法”第1337条。
第二个问题是:
支付总法律顾问办公室的必要费用,45 390 000美元。
有没有办法,通过regex或其他方法,提取我想要的?最终的目标是能够提取所有完整的句子,然后搜索它们来寻找特定的东西。(如果这对解决方案有影响)
发布于 2021-01-18 06:38:45
尝尝这个
regex = "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"
re.split(regex, phrase)发布于 2021-01-18 06:41:48
import re
print ([x for x in re.split(r"([A-Z].+(\(.+\)){0,1}.+)\.\s",s.replace("\n"," ")) if x])输出:
['For necessary expenses of the Office of Inspector General, including employment pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), $99,912,000, including such sums as may be necessary for contracting and other arrangements with public agencies and private persons pursuant to section 6(a)(9) of the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.), and including not to exceed $125,000 for certain confidential operational expenses, including the payment of informants, to be expended under the direction of the Inspector General pursuant to the Inspector General Act of 1978 (Public Law 95–452; 5 U.S.C. App.) and section 1337 of the Agriculture and Food Act of 1981', 'For necessary expenses of the Office of the General 23 Counsel, $45,390,000.']准则是:
regex = r"([A-Z].+(\(.+\)){0,1}.+)\.\s"
re.split(r"([A-Z].+(\(.+\)){0,1}.+)\.\s",s.replace("\n"," "))https://stackoverflow.com/questions/65769689
复制相似问题