我试图在python中编写一个正则表达式,该表达式将找到src属性等于特定值的所有img标记。我试着写以下内容
# where thm equal /public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82
p = re.compile(r'<img.*?%s.*?>' % thm)
print p.pattern
print p.sub(linked_image, c)低于我的产出
<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>
<p><img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt=""></p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf </p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 发布于 2013-12-17 08:57:56
-- LXML的解决方案
为了将解决方案与正则表达式和LXML进行比较,我创建了另一篇文章:
一个更容易和更稳定的解决方案是将lxml与etree结合使用。在该解决方案中,您访问某些DOM元素并对它们进行编辑。
转换HTML并使其通过正确的xpath,例如.//img。xpath返回可以get和set src属性的所有已找到元素的列表。函数etree.tostring(tree)返回一个编辑过的字符串:
from lxml import etree
tree = etree.HTML('''<html>
<body>
<h1>Title</h1>
<img src="/media/old/another_logo.png" alt="" />
<p>Lorem Ipsum</p>
<p><img src="/media/old/logo.png" alt=""/></p>
</body>
</html>''')
imgs = tree.xpath('.//img')
for img in imgs:
print 'OLD_SOURCE', img.get('src')
img.set('src', '/media/new/python.jpg')
print etree.tostring(tree)输出
OLD_SOURCE /media/old/another_logo.png
OLD_SOURCE /media/old/logo.png
<html>
<body>
<h1>Title</h1>
<img src="/media/new/python.jpg" alt=""/>
<p>Lorem Ipsum</p>
<p><img src="/media/new/python.jpg" alt=""/></p>
</body>
</html>发布于 2013-12-15 15:10:40
正则表达式的解决方案
我意识到插入到thm中的字符串没有转义。因此,在将其添加到正则表达式之前,您需要用正则表达式语言中的附加意义来转义所有符号--这里、?和.。
我用[?]{1}代替了?,用\.代替了.。得到的正则表达式现在与测试字符串匹配。
import re
thm = '/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82'
all_html_code = '''<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>
<p><img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt=""></p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf </p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf '''
escaped_thm = thm.replace('.', '\.').replace('?','[?]{1}')
p = re.compile(r'<img.*?src="(%s)".*?>' % escaped_thm)
test_img = '''<img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt="">'''
print p.match(test_img)
new_img_tag = '<img src="/python/logo.jpg" alt=""/>'
print p.sub(new_img_tag, all_html_code)顺便问一下,你为什么要找<img src=""...>?您可以直接替换src属性:
escaped_thm = thm.replace('.', '\.').replace('?','[?]{1}')
p = re.compile(r'src="(%s)"' % escaped_thm)
replacement = '''src="/python/logo.jpg"'''
print p.sub(replacement, all_html_code)输出1
<_sre.SRE_Match object at 0x... >
<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>
<p><img src="/python/logo.jpg" alt=""/></p><p>lksj lksdfj ... lksdjf 输出2
<p><img src="/python/logo.jpg" alt=""></p><p>lksj lksdfj ... lksdjf 在询问了如何正确地转义正则表达式符号(Regular expression to escape regular expressions)之后,我可以推荐一个re.escape,而不是两个replace方法。
使用LXML
您需要使用正则表达式吗?HTML的正则表达式可能有很大的问题。在这里看到更多信息和精彩的帖子(RegEx match open tags except XHTML self-contained tags)。
我宁愿像这里一样使用XPath。
https://stackoverflow.com/questions/20595735
复制相似问题