文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用正则表达式在OPML (XML)文件中查找带引号的属性值

问如何使用正则表达式在OPML (XML)文件中查找带引号的属性值
EN

Stack Overflow用户

提问于 2013-04-25 04:26:28

回答 2查看 998关注 0票数 3

我正在搜索一个OPML文件，它看起来像这样。我想取出大纲文本和xmlUrl。

  <outline text="lol">
  <outline text="Discourse on the Otter" xmlUrl="http://discourseontheotter.tumblr.com/rss" htmlUrl="http://discourseontheotter.tumblr.com/"/>
  <outline text="fedoras of okc" xmlUrl="http://fedorasofokc.tumblr.com/rss" htmlUrl="http://fedorasofokc.tumblr.com/"/>
  </outline>

我的函数：

 import re
 rssName = 'outline text="(.*?)"'
 rssUrl =  'xmlUrl="(.*?)"'

 def rssSearch():
     doc = open('ttrss.txt')
     for line in doc:
        if "xmlUrl" in line:
            mName = re.search(rssName, line)
            mUrl = re.search(rssUrl, line)
            if mName is not None:
                print mName.group()
                print mUrl.group()

但是，返回值如下所示：

 outline text="fedoras of okc"
 xmlUrl="http://fedorasofokc.tumblr.com/rss"

要使我只返回引号之间的字符串，rssName和rssUrl的正确正则表达式是什么？

python

xml

regex

opml

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-04-25 04:38:48

不要使用正则表达式来解析XML。代码杂乱无章，而且有太多地方可能出错。

例如，如果您的OPML提供程序碰巧像这样重新格式化了他们的输出，该怎么办：

<outline text="lol">
  <outline
      htmlUrl="http://discourseontheotter.tumblr.com/"
      xmlUrl="http://discourseontheotter.tumblr.com/rss"
      text="Discourse on the Otter"
  />
  <outline
      htmlUrl="http://fedorasofokc.tumblr.com/"
      xmlUrl="http://fedorasofokc.tumblr.com/rss"
      text="fedoras of okc"
  />
</outline>

这是完全正确的，它的意思是完全相同的。但是面向行的搜索和像'outline text="(.*?)"'这样的正则表达式将被打破。

取而代之的是使用XML解析器。您的代码将更干净、更简单、更可靠：

import xml.etree.cElementTree as ET

root = ET.parse('ttrss.txt').getroot()
for outline in root.iter('outline'):
    text = outline.get('text')
    xmlUrl = outline.get('xmlUrl')
    if text and xmlUrl:
        print text
        print xmlUrl

它既可以处理你的OPML代码片段，也可以处理我在网上找到的类似的OPML文件，比如这个political science list。它非常简单，没有任何棘手之处。(我不是在吹牛，这只是您从使用XML解析器而不是正则表达式获得的好处。)

票数 3

Stack Overflow用户

发布于 2013-04-25 04:29:35

试一试

print mName.group(1)
print mUrl.group(1)

http://docs.python.org/2/library/re.html#re.MatchObject.group

如果groupN参数为零，则相应的返回值是整个匹配字符串；如果它在1..99的包含范围内，则是与相应的带括号的组匹配的字符串。

或

rssName = 'outline text="(?P<text>.*?)"'

然后

print mName.group('text')

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/16201513

复制

相似问题

问如何使用正则表达式在OPML (XML)文件中查找带引号的属性值
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用正则表达式在OPML (XML)文件中查找带引号的属性值EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用正则表达式在OPML (XML)文件中查找带引号的属性值
EN