我有一个大型的xml (就是那个):我在这里提供了一个示例:
<?xml version="1.0" encoding="UTF-8"?>
<hmdb xmlns="http://www.hmdb.ca">
<metabolite>
<normal_concentrations>
<concentration>
<biospecimen>Blood</biospecimen>
<concentration_value>2.8 +/- 8.8</concentration_value>
</concentration>
<concentration>
<biospecimen>Feces</biospecimen>
<concentration_value/>
</concentration>
<concentration>
<biospecimen>Salvia</biospecimen>
<concentration_value>5.2</concentration_value>
</concentration>
</normal_concentrations>
</metabolite>
<metabolite>
<normal_concentrations>
<concentration>
<biospecimen>Blood</biospecimen>
<concentration_value>5</concentration_value>
</concentration>
<concentration>
<biospecimen>Feces</biospecimen>
<concentration_value/>
</concentration>
<concentration>
<biospecimen>Salvia</biospecimen>
<concentration_value>3-7</concentration_value>
</concentration>
</normal_concentrations>
</metabolite>
</hmdb>我现在想拔出所有的生物果胶和concentration_value,并能够把它们联系在一起。我试着这样做:
from io import StringIO
from lxml import etree
import csv
def hmdbextract(name, file):
ns = {'hmdb': 'http://www.hmdb.ca'}
context = etree.iterparse(name, tag='{http://www.hmdb.ca}metabolite')
csvfile = open(file, 'w')
fieldnames = ['normal_concentration_spec',
'normal_concentration_conc']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for event, elem in context:
try:
tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:biospecimen/text()', namespaces=ns)
normal_concentration_spec = '; '.join(str(e) for e in tl)
except:
normal_concentration_spec = 'NA'
try:
tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:concentration_value/text()', namespaces=ns)
normal_concentration_conc = '; '.join(str(e) for e in tl)
except:
normal_concentration_conc = 'NA'
writer.writerow({'normal_concentration_spec': normal_concentration_spec,
'normal_concentration_conc': normal_concentration_conc})
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
return;
hmdbextract('hmdb_file.xml', 'hmmdb_file.csv')输出csv应该如下所示:
normal_concentration_spec,normal_concentration_conc
Blood; Feces; Salvia,2.8 +/- 8.8; NA; 5.2
Blood; Feces; Salvia,5; NA; 3-7实际上,我还拿出了许多其他的东西,每个代谢物只有一个值,这就是为什么我更喜欢这种csv格式。但是,由于一些concentration_value插槽是空的,我只会得到不同数量的样本和值,并且无法分辨哪一个属于哪一个,
我如何才能得到像安娜一样的价值,为每一个缺失的concentration_value?(理想情况下,在保持代码和lxml包的一般结构的同时,我必须提取很多已经设置好的东西)
发布于 2022-08-29 17:23:54
空元素将返回零长度列表。它可以用来显示NA。
>>> context = etree.iterparse('tmp.xml', tag='{http://www.hmdb.ca}concentration_value')
>>> for event, elem in context:
... tlc = elem.xpath('text()', namespaces=ns)
... print(len(tlc), tlc)
...
1 ['2.8 +/- 8.8']
0 []
1 ['5.2']使用OP代码
from lxml import etree
ns = {'hmdb': 'http://www.hmdb.ca'}
context = etree.iterparse('/home/luis/tmp/tmp.xml', tag='{http://www.hmdb.ca}metabolite')
for event, elem in context:
try:
tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:biospecimen', namespaces=ns)
normal_concentration_spec = '; '.join(str(e.text) for e in tl)
except Exception as ex:
print(ex)
normal_concentration_spec = 'NA'
try:
tl = elem.xpath('hmdb:normal_concentrations/hmdb:concentration/hmdb:concentration_value', namespaces=ns)
normal_concentration_conc = '; '.join(str(e.text if e.text!=None else 'NA') for e in tl)
except Exception as ex:
normal_concentration_conc = 'NA'
print(normal_concentration_spec, normal_concentration_conc)结果
Blood; Feces; Salvia 2.8 +/- 8.8; NA; 5.2
Blood; Feces; Salvia 5; NA; 3-7https://stackoverflow.com/questions/73531640
复制相似问题