首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何使用regex模式在python中使用倒数第一个句子

如何使用regex模式在python中使用倒数第一个句子
EN

Stack Overflow用户
提问于 2020-05-13 08:20:14
回答 1查看 58关注 0票数 0

我有netx文本:

代码语言:javascript
复制
"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible spectrum]] is absorbed through [[photosynthesis]])
"

"TITULO: Albedo SUBTITULO Y PARRAFO: ==Human activities==
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas around 
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===U.S. House of Representatives, 1847–1849===
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle 

我想在python中使用regex创建一个数据帧:

代码语言:javascript
复制
Tile                   Head                          TEXT
Albedo                 Trees                         Because forests generally have a low  ...([[photosynthesis]])
Albedo                 Human activities              Human activities (e.g., de...areas around 
Abraham Lincoln        U.S. House of..1849           [[File:Abraham Lincoln by... line Whig,

我有这个代码,第一列和第二列它可以工作,但第三列我不知道如何从上一个==或===或====向前获取?也就是说。

因为森林的反照率通常很低(大部分的紫外线和可见光光谱是通过光合作用吸收的)

人类活动(例如,毁林、农业和城市化)改变了周围不同地区的反照率。

[[文件:亚伯拉罕·林肯,Nicholas Shepherd,1846-crop.jpg|拇指|直立|alt=Mid.

代码语言:javascript
复制
import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

with open("datos_titulos.csv", "r") as f:
    for line in f:


        pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
        pat2 = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==$|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)===$"

        if re.search(pat, line) :

            pandas_dict["title"].append(re.search(pat, line).group(1))
            pandas_dict["head"].append(re.search(pat, line).group(2))
        if re.search(pat2, line) :

            pandas_dict["text"].append(re.search(pat2, line).group(2))
df = pd.DataFrame(pandas_dict) 
EN

回答 1

Stack Overflow用户

发布于 2020-05-19 09:25:46

代码语言:javascript
复制
import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

regx_title = r"TITULO: (.*) SUBTITULO"
regx_head = r"={2,}(.*?)={2,}"
regx_text = r"^(?!\")(.+)"

regex_list = [regx_title, regx_head, regx_text]

with open("datos_titulos.csv", "r") as f:
    for line in f:
        for i, regx in enumerate(regex_list):
            r = re.findall(regx, line)
            if r:
                pandas_dict[i].append(r[0])

df = pd.DataFrame(pandas_dict)
df = df.rename(columns={0:"Title", 1:"Head", 2:"TEXT"})

with pd.option_context('display.max_colwidth', 25):
    print(df)

详细信息

  • "TITULO: (.*) SUBTITULO"

代码语言:javascript
复制
- `(.*)` - Capturing group: any 0 or more chars other than line break chars

  • ={2,}(.*?)={2,}

代码语言:javascript
复制
- `={2,}` - two or more character `=` literally
- `(.*?)` - any 0 or more chars other than line break chars (as few times as possible)
- `={2,}` - two or more character `=` literally

  • ^(?!\")(.+)

代码语言:javascript
复制
- `^` - start of a line
- `(?!` - Negative Lookahead (assert that the regex below does not match)  
    - `\"` - matches the character `"` literally

代码语言:javascript
复制
- `)` - close Negative Lookahead
- `(.+)` - Capturing group: any 1 or more chars other than line break chars

此外,最后一个正则表达式捕获所有不以"开头的行,因此请确保除了"之外,该行中只有您想要的text字段开头。从您的示例文本可以很好地工作,可以在提供的输出中看到,但是如果出于任何原因,您的text字段在同一行上,您可以使用以下正则表达式:

代码语言:javascript
复制
TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)

使用以下代码:

代码语言:javascript
复制
import pandas as pd
import re

regx = r"TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)"

df1 = pd.DataFrame()
with open("datos_titulos_inline.csv", "r") as f:
    for line in f:
        r = re.findall(regx, line)
        if r:
            df1 = df1.append([[r[0][0], r[1][1], r[2][2]]])

df1 = df1.rename(columns={0: "Title", 1: "Head", 2: "TEXT"}).reset_index(drop=True)

with pd.option_context('display.max_colwidth', 25):
    print(df1)

请参阅用于内联text的正则表达式demo字段。

输出数据帧:

代码语言:javascript
复制
             Title                      Head                      TEXT
0           Albedo                     Trees   Because forests gene...
1           Albedo          Human activities   Human activities (e....
2  Abraham Lincoln  U.S. House of Represe...   [[File:Abraham Linco...
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61764240

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档