我尝试将像这样的urls (http://musicbrainz.org/ws/2/artist/72c536dc-7137-4477-a521-567eeb840fa8)导入到python中,并提取“性别”的值。
import urllib2
import codecs
import sys
import os
from xml.dom import minidom
import xml.etree.cElementTree as ET
#urlbob = urllib2.urlopen('http://musicbrainz.org/ws/2/artist/72c536dc-7137-4477-a521-567eeb840fa8')
url = 'dylan.xml'
#attempt 1 - using minidom
xmldoc = minidom.parse(url)
itemlist = xmldoc.getElementsByTagName('artist')
#attempt 2 - using ET
tree = ET.parse('dylan.xml')
root = tree.getroot()
for child in root:
print child.tag, child.attrib 我似乎不能通过mini-dom或者etree之类的东西来理解性别。在当前形式下,该脚本返回
{http://musicbrainz.org/ns/mmd-2.0#}artist {'type': 'Person', 'id': '72c536dc-7137-4477-a521-567eeb840fa8'}发布于 2014-11-17 07:51:18
这是因为您循环的root只是树的根,这有意义吗?当你循环root的时候,它只会返回下一个子元素,然后就到此为止。
你需要循环迭代器,这样它才能返回下一个节点并获得结果,请看下面的内容:
tree = ET.parse('dylan.xml')
root = tree.getroot()
# loop the root iterable which will keep returning next node
for node in root.iter(): # or root.getiterator() if < Python 2.7
print node.tag, node.attrib, node.text结果:
{http://musicbrainz.org/ns/mmd-2.0#}metadata {} None
{http://musicbrainz.org/ns/mmd-2.0#}artist {'type': 'Person', 'id': '72c536dc-7137-4477-a521-567eeb840fa8'} None
{http://musicbrainz.org/ns/mmd-2.0#}name {} Bob Dylan
{http://musicbrainz.org/ns/mmd-2.0#}sort-name {} Dylan, Bob
{http://musicbrainz.org/ns/mmd-2.0#}ipi {} 00008955074
{http://musicbrainz.org/ns/mmd-2.0#}ipi-list {} None
{http://musicbrainz.org/ns/mmd-2.0#}ipi {} 00008955074
{http://musicbrainz.org/ns/mmd-2.0#}ipi {} 00008955172
{http://musicbrainz.org/ns/mmd-2.0#}isni-list {} None
{http://musicbrainz.org/ns/mmd-2.0#}isni {} 0000000121479733
{http://musicbrainz.org/ns/mmd-2.0#}gender {} Male
{http://musicbrainz.org/ns/mmd-2.0#}country {} US
{http://musicbrainz.org/ns/mmd-2.0#}area {'id': '489ce91b-6658-3307-9877-795b68554c98'} None
{http://musicbrainz.org/ns/mmd-2.0#}name {} United States
{http://musicbrainz.org/ns/mmd-2.0#}sort-name {} United States
{http://musicbrainz.org/ns/mmd-2.0#}iso-3166-1-code-list {} None
{http://musicbrainz.org/ns/mmd-2.0#}iso-3166-1-code {} US
{http://musicbrainz.org/ns/mmd-2.0#}begin-area {'id': '04e60741-b1ae-4078-80bb-ffe8ae643ea7'} None
{http://musicbrainz.org/ns/mmd-2.0#}name {} Duluth
{http://musicbrainz.org/ns/mmd-2.0#}sort-name {} Duluth
{http://musicbrainz.org/ns/mmd-2.0#}life-span {} None
{http://musicbrainz.org/ns/mmd-2.0#}begin {} 1941-05-24发布于 2014-11-17 08:28:34
## This prints out the tree as the xml lib sees it
## (I found it made debugging a little easier)
#def print_xml(node, depth = 0):
# for child in node:
# print "\t"*depth + str(child)
# print_xml(child, depth = depth + 1)
#print_xml(root)
# attempt 1
xmldoc = minidom.parse(url)
genders = xmldoc.getElementsByTagName('gender') # <== you want gender not artist
for gender in genders:
print gender.firstChild.nodeValue
# attempt 2
ns = "{http://musicbrainz.org/ns/mmd-2.0#}"
xlpath = "./" + ns + "artist/" + ns + "gender"
genders = root.findall(xlpath) # <== xpath was made for this..
for gender in genders:
print gender.text所以..。第一次尝试的问题是,您看到的是一个包含所有艺术家元素的列表,而不是性别元素(列表中唯一的艺术家元素的子元素)。
第二次尝试的问题是,您正在查看根元素的子元素的列表(这是一个包含单个元数据元素的列表)。
底层结构是:
<artist>
<name>
<sort-name>
<ipi>
<ipi-list>
<ipi>
<ipi>
<isni-list>
<isni>
<gender>
<country>
<area>
<name>
<sort-name>
<iso-3166-1-code-list>
<iso-3166-1-code>
<begin-area>
<name>
<sort-name>
<life-span>
<begin>因此,您需要获取根->艺术家->性别,或者只搜索您实际需要的节点(在本例中为性别)。
https://stackoverflow.com/questions/26963286
复制相似问题