我有netx文本:
"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible spectrum]] is absorbed through [[photosynthesis]])
"
"TITULO: Albedo SUBTITULO Y PARRAFO: ==Human activities==
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas around
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===U.S. House of Representatives, 1847–1849===
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle 我想在python中使用regex创建一个数据帧:
Tile Head TEXT
Albedo Trees Because forests generally have a low ...([[photosynthesis]])
Albedo Human activities Human activities (e.g., de...areas around
Abraham Lincoln U.S. House of..1849 [[File:Abraham Lincoln by... line Whig,我有这个代码,第一列和第二列它可以工作,但第三列我不知道如何从上一个==或===或====向前获取?也就是说。
因为森林的反照率通常很低(大部分的紫外线和可见光光谱是通过光合作用吸收的)
人类活动(例如,毁林、农业和城市化)改变了周围不同地区的反照率。
[[文件:亚伯拉罕·林肯,Nicholas Shepherd,1846-crop.jpg|拇指|直立|alt=Mid.
import re
from collections import defaultdict
import pandas as pd
pandas_dict = defaultdict(list)
with open("datos_titulos.csv", "r") as f:
for line in f:
pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
pat2 = r"TITULO: (.*) SUBTITULO Y PARRAFO: ==(.*?)==$|rTITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)===$"
if re.search(pat, line) :
pandas_dict["title"].append(re.search(pat, line).group(1))
pandas_dict["head"].append(re.search(pat, line).group(2))
if re.search(pat2, line) :
pandas_dict["text"].append(re.search(pat2, line).group(2))
df = pd.DataFrame(pandas_dict) 发布于 2020-05-19 09:25:46
import re
from collections import defaultdict
import pandas as pd
pandas_dict = defaultdict(list)
regx_title = r"TITULO: (.*) SUBTITULO"
regx_head = r"={2,}(.*?)={2,}"
regx_text = r"^(?!\")(.+)"
regex_list = [regx_title, regx_head, regx_text]
with open("datos_titulos.csv", "r") as f:
for line in f:
for i, regx in enumerate(regex_list):
r = re.findall(regx, line)
if r:
pandas_dict[i].append(r[0])
df = pd.DataFrame(pandas_dict)
df = df.rename(columns={0:"Title", 1:"Head", 2:"TEXT"})
with pd.option_context('display.max_colwidth', 25):
print(df)详细信息
"TITULO: (.*) SUBTITULO"- `(.*)` - Capturing group: any 0 or more chars other than line break chars
={2,}(.*?)={2,}- `={2,}` - two or more character `=` literally
- `(.*?)` - any 0 or more chars other than line break chars (as few times as possible)
- `={2,}` - two or more character `=` literally
^(?!\")(.+)- `^` - start of a line
- `(?!` - Negative Lookahead (assert that the regex below does not match)
- `\"` - matches the character `"` literally
- `)` - close Negative Lookahead
- `(.+)` - Capturing group: any 1 or more chars other than line break chars
此外,最后一个正则表达式捕获所有不以"开头的行,因此请确保除了"之外,该行中只有您想要的text字段开头。从您的示例文本可以很好地工作,可以在提供的输出中看到,但是如果出于任何原因,您的text字段在同一行上,您可以使用以下正则表达式:
TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)使用以下代码:
import pandas as pd
import re
regx = r"TITULO: (.*) SUBTITULO|={2,}(.*?)={2,}|(?<===)(.+)"
df1 = pd.DataFrame()
with open("datos_titulos_inline.csv", "r") as f:
for line in f:
r = re.findall(regx, line)
if r:
df1 = df1.append([[r[0][0], r[1][1], r[2][2]]])
df1 = df1.rename(columns={0: "Title", 1: "Head", 2: "TEXT"}).reset_index(drop=True)
with pd.option_context('display.max_colwidth', 25):
print(df1)请参阅用于内联text的正则表达式demo字段。
输出数据帧:
Title Head TEXT
0 Albedo Trees Because forests gene...
1 Albedo Human activities Human activities (e....
2 Abraham Lincoln U.S. House of Represe... [[File:Abraham Linco...https://stackoverflow.com/questions/61764240
复制相似问题