我正在尝试用href="../directory/index.html"替换所有的href="../directory/index.html"实例。
在Python中,这个
reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
output_html = input_html.replace(match, match+'index.html')产生以下输出:
href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html"
href="../paternalism/index.html"
href="../principle-beneficence/index.htmlindex.htmlindex.html"
href="../decision-capacity/index.htmlindex.htmlindex.html" 你知道为什么它和第二个链接一起工作,但其他链接不工作吗?
资料来源的有关部分:
<p>
<a href="../personal-autonomy/">autonomy: personal</a> |
<a href="../principle-beneficence/">beneficence, principle of</a> |
<a href="../decision-capacity/">decision-making capacity</a> |
<a href="../legal-obligation/">legal obligation and authority</a> |
<a href="../paternalism/">paternalism</a> |
<a href="../identity-personal/">personal identity</a> |
<a href="../identity-ethics/">personal identity: and ethics</a> |
<a href="../respect/">respect</a> |
<a href="../well-being/">well-being</a>
</p> 编辑:重复的'index.html‘实际上是多个匹配的结果。(例如,href="../personal-autonomy/index.htmlindex.htmlindex.htmlindex.html“是因为在原始源中发现了四次个人自主权)。
作为一个一般性的正则问题,您如何在不向所有匹配添加额外的“index.html”的情况下替换所有实例?
发布于 2011-01-27 14:26:48
import re
from lxml import html
def replace_link(link):
if re.match(r"\.\./[^/]+/$", link):
link += "index.html"
return link
print html.rewrite_links(your_html_text, replace_link)输出
<p>
<a href="../personal-autonomy/index.html">autonomy: personal</a> |
<a href="../principle-beneficence/index.html">beneficence, principle of</a> |
<a href="../decision-capacity/index.html">decision-making capacity</a> |
<a href="../legal-obligation/index.html">legal obligation and authority</a> |
<a href="../paternalism/index.html">paternalism</a> |
<a href="../identity-personal/index.html">personal identity</a> |
<a href="../identity-ethics/index.html">personal identity: and ethics</a> |
<a href="../respect/index.html">respect</a> |
<a href="../well-being/index.html">well-being</a>
</p>发布于 2011-01-27 14:09:07
我想我发现了这个问题
reg = re.compile(r'<a href="../(.*?)">')
for match in re.findall(reg, input_html):
output_html = input_html.replace(match, match+'index.html')
这里,在for循环中修改'input_html‘,然后再次搜索相同的'input_html’,以查找regex,即bug :)
发布于 2011-01-27 12:44:45
让你的平局逃离你的前两个.
reg = re.compile(r'<a[ ]href="[.][.]/(.*?)">')但我会尝试使用lxml代替。
https://stackoverflow.com/questions/4816377
复制相似问题