我有一个大的CSV,它是由一个"ID“列和”历史“列组成的。
ID很简单,只是一个整数。
然而,历史记录是一个单元格,由文本区域中的* NOTE *分隔的多达数百个条目组成。
我想用Python和CSV模块来解析这些数据,并将其作为新的CSV导出,如下所示。
现有数据结构:
ID,History
56457827, "*** NOTE ***
2014-02-25
Long note here. This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here. This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."所需数据结构:
ID, Date, History
56457827, 2014-02-25, "Long note here. This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here. This is the text portion."
56457896, 2015-05-24, "Another example yet again."所以我需要掌握一些命令。我猜有一个循环会带来我可以管理的数据,但是我需要分析数据。
我想我需要:
发布于 2018-12-19 13:55:45
好的,您可以轻松地用csv模块解析输入文件,但是您需要设置skipinitialspace,因为您的文件在逗号后面有空格。我还假设标题后面的空行不应该在那里。
然后,您应该在'*** NOTE ***'上拆分History列。每张便笺的第一行应为日期,其余部分为实际历史。代码可以是:
with open(input_file_name, newline = '') as fd, \
open(output_file_name, "w", newline='') as fdout:
rd = csv.reader(fd, skipinitialspace=True)
ID, Hist = next(rd) # skip header line
wr = csv.writer(fdout)
_ = wr.writerow((ID, 'Date', Hist)) # write header of output file
for row in rd:
# print(row) # uncomment for debug traces
hists = row[1].split('*** NOTE ***')
for h in hists:
h = h.strip()
if len(h) == 0: # skip initial empty note
continue
# should begin with a data line
date, h2 = h.split('\n', 1)
_ = wr.writerow((row[0], date.strip(), h2.strip()))发布于 2018-12-19 12:59:41
尽情享受
with open('data.csv') as f:
header = f.readline() # skip headers line
blank_line = f.readline() # blank line
current_record = None
s = f.readline() # blank line
while s:
if not current_record:
current_record = s
else:
current_record += s
if s.rstrip().endswith('"'):
# Remove line breaks
current_record = current_record.replace('\r', ' ').replace('\n', ' ')
# Get date and history
ID, history = current_record.split(',', 1)
# dequote history
history = history.strip(' "')
# split history into items
items = [note.strip().split(' ', 1) for note in history.split('*** NOTE ***') if note]
for datetime, message in items:
print ('{}, {}, {}'.format(ID, datetime, message))
current_record = None
s = f.readline()https://stackoverflow.com/questions/53851122
复制相似问题