文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用regex模式在python中使用倒数第一个句子

问如何使用regex模式在python中使用倒数第一个句子
EN

Stack Overflow用户

提问于 2020-05-13 08:20:14

回答 1查看 58关注 0票数 0

我有netx文本：

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible spectrum]] is absorbed through [[photosynthesis]])
"

"TITULO: Albedo SUBTITULO Y PARRAFO: ==Human activities==
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas around 
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===U.S. House of Representatives, 1847–1849===
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle

我想在python中使用regex创建一个数据帧：

Tile                   Head                          TEXT
Albedo                 Trees                         Because forests generally have a low  ...([[photosynthesis]])
Albedo                 Human activities              Human activities (e.g., de...areas around 
Abraham Lincoln        U.S. House of..1849           [[File:Abraham Lincoln by... line Whig,

我有这个代码，第一列和第二列它可以工作，但第三列我不知道如何从上一个==或===或====向前获取？也就是说。

因为森林的反照率通常很低(大部分的紫外线和可见光光谱是通过光合作用吸收的)

人类活动(例如，毁林、农业和城市化)改变了周围不同地区的反照率。

[[文件:亚伯拉罕·林肯，Nicholas Shepherd，1846-crop.jpg|拇指|直立|alt=Mid.

import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

with open("datos_titulos.csv", "r") as f:
    for line in f:


        pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
        pat2 = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==$|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)===$"

        if re.search(pat, line) :

            pandas_dict["title"].append(re.search(pat, line).group(1))
            pandas_dict["head"].append(re.search(pat, line).group(2))
        if re.search(pat2, line) :

            pandas_dict["text"].append(re.search(pat2, line).group(2))
df = pd.DataFrame(pandas_dict)

python

regex

pandas

dataframe

回答 1

Stack Overflow用户

发布于 2020-05-19 09:25:46

import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

regx_title = r"TITULO: (.*) SUBTITULO"
regx_head = r"={2,}(.*?)={2,}"
regx_text = r"^(?!\")(.+)"

regex_list = [regx_title, regx_head, regx_text]

with open("datos_titulos.csv", "r") as f:
    for line in f:
        for i, regx in enumerate(regex_list):
            r = re.findall(regx, line)
            if r:
                pandas_dict[i].append(r[0])

df = pd.DataFrame(pandas_dict)
df = df.rename(columns={0:"Title", 1:"Head", 2:"TEXT"})

with pd.option_context('display.max_colwidth', 25):
    print(df)

详细信息

"TITULO: (.*) SUBTITULO"

- `(.*)` - Capturing group: any 0 or more chars other than line break chars

={2,}(.*?)={2,}

- `={2,}` - two or more character `=` literally
- `(.*?)` - any 0 or more chars other than line break chars (as few times as possible)
- `={2,}` - two or more character `=` literally

^(?!\")(.+)

- `^` - start of a line
- `(?!` - Negative Lookahead (assert that the regex below does not match)  
    - `\"` - matches the character `"` literally

- `)` - close Negative Lookahead
- `(.+)` - Capturing group: any 1 or more chars other than line break chars

此外，最后一个正则表达式捕获所有不以"开头的行，因此请确保除了"之外，该行中只有您想要的text字段开头。从您的示例文本可以很好地工作，可以在提供的输出中看到，但是如果出于任何原因，您的text字段在同一行上，您可以使用以下正则表达式：

TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)

使用以下代码：

import pandas as pd
import re

regx = r"TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)"

df1 = pd.DataFrame()
with open("datos_titulos_inline.csv", "r") as f:
    for line in f:
        r = re.findall(regx, line)
        if r:
            df1 = df1.append([[r[0][0], r[1][1], r[2][2]]])

df1 = df1.rename(columns={0: "Title", 1: "Head", 2: "TEXT"}).reset_index(drop=True)

with pd.option_context('display.max_colwidth', 25):
    print(df1)

请参阅用于内联text的正则表达式demo字段。

输出数据帧：

             Title                      Head                      TEXT
0           Albedo                     Trees   Because forests gene...
1           Albedo          Human activities   Human activities (e....
2  Abraham Lincoln  U.S. House of Represe...   [[File:Abraham Linco...

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61764240

复制

相似问题

问如何使用regex模式在python中使用倒数第一个句子
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用regex模式在python中使用倒数第一个句子EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用regex模式在python中使用倒数第一个句子
EN