文章/答案/技术大牛

发布

社区首页 >问答首页 >如何提取非常大的xml数据并将其存储在python中的字典中

问如何提取非常大的xml数据并将其存储在python中的字典中
EN

Stack Overflow用户

提问于 2020-08-18 17:15:36

回答 2查看 359关注 0票数 0

我有以下名为Comments.xml的XML文件，其大小为15 GB。我想要一本有两个键的字典，即UserId和Text。注意，文件中有许多UserId和Text的缺失值。我尝试了下面的代码，但是由于文件大小太大，RAM (13 GB RAM)崩溃了。是否有一种有效的方法从xml文件中获取数据以进行数据分析？

xml文件的Comments.xml的部分

<comments>
<row Id = '1' UserId = '143' Text = 'Hello World'>
<row Id = '2' UserId = '183' Text = 'Trigonometry is important.'>
<row Id = '3' UserId = '5645' Text = 'Mathematics is best.'>
<row Id = '4' UserId = '143' Text = 'Hello stack overflow'>
<row Id = '5' UserId = '143' Text = 'Hello'>

码

import xml.etree.cElementTree as ET

tree = ET.iterparse('Comments.xml')

comments = {} #Dictionary to store the required data

for event, root in tree:

  if (('Text' in root.attrib) and ('UserId' in root.attrib)): #To check for missing values
    Text = root.attrib['Text']
    UserId = root.attrib['UserId']
    userid_comments.update({UserId:Text}) #Adding data to dictionary
    root.clear()

预期输出

{'143':'Hello World','183':'Trigonometry is important.','5645':'Mathematics is best.','143':'Hello stack overflow','143':'Hello'}

OR

{'UserId':['143','183','5645','143','143'],'Text':['Hello World','Trigonometry is important.','Mathematics is best.','Hello stack overflow','Hello']}

dictionary

xml-parsing

python

python-3.x

xml

回答 2

Stack Overflow用户

发布于 2020-08-19 03:19:27

另一种方法。

import io
from simplified_scrapy import SimplifiedDoc

def getComments(fileName):
    comments = {'UserId': [], 'Text': []}
    with io.open(fileName, "r", encoding='utf-8') as file:
        line = file.readline()  # Read data line by line
        while line != '':
            doc = SimplifiedDoc(line)  # Instantiate a doc
            row = doc.getElement('row')  # Get row
            if row:
                comments['UserId'].append(row['UserId'])
                comments['Text'].append(row['Text'])
            line = file.readline()
    return comments
comments = getComments('Comments.xml')  # This dictionary will be very large, too

票数 0

Stack Overflow用户

发布于 2020-08-19 06:24:54

见下文。

这将而不是来解决您的内存问题。为了解决RAM问题，需要使用SAX：

Simple API (SAX)−在这里注册有关事件的回调，然后让解析器继续遍历文档。当您的文档很大或者您有内存限制时，这是非常有用的，当它从磁盘读取文件时，它会解析该文件，并且整个文件永远不会存储在内存中。

import xml.etree.ElementTree as ET
from collections import defaultdict

data = defaultdict(list)

xml = '''<comments>
<row Id = "1" UserId = "143" Text = "Hello World"/>
<row Id = "2" UserId = "183" Text = "Trigonometry is important."/>
<row Id = "3" UserId = "5645" Text = "Mathematics is best."/>
<row Id = "4" UserId = "143" Text = "Hello stack overflow"/>
<row Id = "5" UserId = "143" Text = "Hello"/></comments>'''

root = ET.fromstring(xml)
for row in root.findall('.//row'):
    user_id = row.attrib.get('UserId')
    text = row.attrib.get('Text')
    if user_id is not None and text is not None:
        data[user_id].append(text)
print(data)

输出

defaultdict(<class 'list'>, {'143': ['Hello World', 'Hello stack overflow', 'Hello'], '183': ['Trigonometry is important.'], '5645': ['Mathematics is best.']})

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63473548

复制

相似问题

问如何提取非常大的xml数据并将其存储在python中的字典中
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何提取非常大的xml数据并将其存储在python中的字典中EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何提取非常大的xml数据并将其存储在python中的字典中
EN