文章/答案/技术大牛

发布

社区首页 >问答首页 >解析一个非常大的CSV文件。需要将一个字段拆分成许多较小的行&将ID保留在每一行中。

问解析一个非常大的CSV文件。需要将一个字段拆分成许多较小的行&将ID保留在每一行中。
EN

Stack Overflow用户

提问于 2018-12-19 12:17:35

回答 2查看 209关注 0票数 3

我有一个大的CSV，它是由一个"ID“列和”历史“列组成的。

ID很简单，只是一个整数。

然而，历史记录是一个单元格，由文本区域中的* NOTE *分隔的多达数百个条目组成。

我想用Python和CSV模块来解析这些数据，并将其作为新的CSV导出，如下所示。

现有数据结构：

ID,History

56457827, "*** NOTE ***
2014-02-25
Long note here.  This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here.  This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."

所需数据结构：

ID, Date, History

56457827, 2014-02-25, "Long note here.  This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here.  This is the text portion."
56457896, 2015-05-24, "Another example yet again."

所以我需要掌握一些命令。我猜有一个循环会带来我可以管理的数据，但是我需要分析数据。

我想我需要：

1开始循环通过CSV结构
2注意第一个ID
3在历史字段中搜索*注意*
4以某种方式抓取日期串并记下它。
5将在日期字符串之后找到的所有以下字符串数据添加到变量(让我们称之为"historyShaper")，直到.
6 ..。直到我找到下一个*注意*
7从新变量“”中删除*注意*的所有实例
8将ID和"historyShaper“写入新CSV文件中的新行。
9重复步骤2-8，直到历史字段结束 这个文件大约是5MB。这是最好的方法吗？我对编程和数据操作还比较陌生，所以在我今晚打开笔记本电脑并深入挖掘之前，我对任何建设性的批评都持开放态度。 非常感谢，所有的反馈都非常感谢。

python

python-3.x

csv

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-12-19 13:55:45

好的，您可以轻松地用csv模块解析输入文件，但是您需要设置skipinitialspace，因为您的文件在逗号后面有空格。我还假设标题后面的空行不应该在那里。

然后，您应该在'*** NOTE ***'上拆分History列。每张便笺的第一行应为日期，其余部分为实际历史。代码可以是：

with open(input_file_name, newline = '') as fd, \
     open(output_file_name, "w", newline='') as fdout:
    rd = csv.reader(fd, skipinitialspace=True)
    ID, Hist = next(rd)    # skip header line
    wr = csv.writer(fdout)
    _ = wr.writerow((ID, 'Date', Hist))  # write header of output file
    for row in rd:
        # print(row)      # uncomment for debug traces
        hists = row[1].split('*** NOTE ***')
        for h in hists:
            h = h.strip()
            if len(h) == 0:     # skip initial empty note
                continue
            # should begin with a data line
            date, h2 = h.split('\n', 1)
            _ = wr.writerow((row[0], date.strip(), h2.strip()))

票数 1

Stack Overflow用户

发布于 2018-12-19 12:59:41

尽情享受

with open('data.csv') as f:
    header = f.readline()    # skip headers line
    blank_line = f.readline()    # blank line

    current_record = None
    s = f.readline()    # blank line
    while s:
        if not current_record:
            current_record = s
        else:
            current_record += s
            if s.rstrip().endswith('"'):
                # Remove line breaks
                current_record = current_record.replace('\r', ' ').replace('\n', ' ')
                # Get date and history
                ID, history = current_record.split(',', 1)
                # dequote history
                history = history.strip(' "')
                # split history into items
                items = [note.strip().split(' ', 1) for note in history.split('*** NOTE ***') if note]
                for datetime, message in items:
                    print ('{}, {}, {}'.format(ID, datetime, message))

                current_record = None

        s = f.readline()

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53851122

复制

相似问题

问解析一个非常大的CSV文件。需要将一个字段拆分成许多较小的行&将ID保留在每一行中。
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解析一个非常大的CSV文件。需要将一个字段拆分成许多较小的行&将ID保留在每一行中。EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问解析一个非常大的CSV文件。需要将一个字段拆分成许多较小的行&将ID保留在每一行中。
EN