首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >python正则表达式查找并替换具有特定属性值的html标记

python正则表达式查找并替换具有特定属性值的html标记
EN

Stack Overflow用户
提问于 2013-12-15 14:43:52
回答 2查看 2.2K关注 0票数 3

我试图在python中编写一个正则表达式,该表达式将找到src属性等于特定值的所有img标记。我试着写以下内容

代码语言:javascript
复制
   # where thm equal /public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82
   p = re.compile(r'<img.*?%s.*?>' % thm)
   print p.pattern
   print p.sub(linked_image, c)

低于我的产出

代码语言:javascript
复制
<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>

<p><img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt=""></p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf </p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf 
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-12-17 08:57:56

-- LXML的解决方案

为了将解决方案与正则表达式和LXML进行比较,我创建了另一篇文章:

一个更容易和更稳定的解决方案是将lxmletree结合使用。在该解决方案中,您访问某些DOM元素并对它们进行编辑。

转换HTML并使其通过正确的xpath,例如.//imgxpath返回可以getset src属性的所有已找到元素的列表。函数etree.tostring(tree)返回一个编辑过的字符串:

代码语言:javascript
复制
from lxml import etree
tree = etree.HTML('''<html>
                     <body>
                        <h1>Title</h1>
                        <img src="/media/old/another_logo.png" alt="" />
                        <p>Lorem Ipsum</p>
                        <p><img src="/media/old/logo.png" alt=""/></p>
                     </body>
                  </html>''')

imgs = tree.xpath('.//img')

for img in imgs:
    print 'OLD_SOURCE', img.get('src')
    img.set('src', '/media/new/python.jpg')

print etree.tostring(tree)

输出

代码语言:javascript
复制
OLD_SOURCE /media/old/another_logo.png
OLD_SOURCE /media/old/logo.png

<html>
    <body>
        <h1>Title</h1>
            <img src="/media/new/python.jpg" alt=""/>
            <p>Lorem Ipsum</p>
            <p><img src="/media/new/python.jpg" alt=""/></p>
    </body>
</html>
票数 2
EN

Stack Overflow用户

发布于 2013-12-15 15:10:40

正则表达式的解决方案

我意识到插入到thm中的字符串没有转义。因此,在将其添加到正则表达式之前,您需要用正则表达式语言中的附加意义来转义所有符号--这里、?.

我用[?]{1}代替了?,用\.代替了.。得到的正则表达式现在与测试字符串匹配。

代码语言:javascript
复制
import re
thm = '/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82'

all_html_code = '''<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>

<p><img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt=""></p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf </p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf
</p><p>lksj lksdfj lsdkfj sldkfj sldkfj lskdfj lsjf lksjf lksj flksdjf klsj flk dkj sdlkfj sdlkfj sldkjf sldkfj lsdkjf lskjflsjfsl lksdjf '''

escaped_thm = thm.replace('.', '\.').replace('?','[?]{1}')
p = re.compile(r'<img.*?src="(%s)".*?>' % escaped_thm)

test_img = '''<img src="/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82" alt="">'''
print p.match(test_img)

new_img_tag = '<img src="/python/logo.jpg" alt=""/>'
print p.sub(new_img_tag, all_html_code)

顺便问一下,你为什么要找<img src=""...>?您可以直接替换src属性:

代码语言:javascript
复制
escaped_thm = thm.replace('.', '\.').replace('?','[?]{1}')
p = re.compile(r'src="(%s)"' % escaped_thm)

replacement = '''src="/python/logo.jpg"'''
print p.sub(replacement, all_html_code)

输出1

代码语言:javascript
复制
<_sre.SRE_Match object at 0x... >
<img.*?/public_media/cache/84/b5/84b59e293cbdb7041b68a84977d62cf3.jpg?image_pk=82.*?>

<p><img src="/python/logo.jpg" alt=""/></p><p>lksj lksdfj ... lksdjf 

输出2

代码语言:javascript
复制
<p><img src="/python/logo.jpg" alt=""></p><p>lksj lksdfj ... lksdjf 

在询问了如何正确地转义正则表达式符号(Regular expression to escape regular expressions)之后,我可以推荐一个re.escape,而不是两个replace方法。

使用LXML

您需要使用正则表达式吗?HTML的正则表达式可能有很大的问题。在这里看到更多信息和精彩的帖子(RegEx match open tags except XHTML self-contained tags)。

我宁愿像这里一样使用XPath。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/20595735

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档