我有一个683,500行的.txt文件,每7行包含一个不同的人:
I想要阅读.txt和输出(可以是json、csv、txt,甚至数据库中的输出)
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service timetxt中的示例:
00000000886
曼努埃尔·德鲁苏斯·苏伯维
媒体总指挥
2006-08
2021-09
30,556.04
15.7
00000000086
曼努埃尔·德鲁苏斯·苏伯维
媒体总指挥
2006-01
2021-09
30,556.04
15.7
00100000086
曼努埃尔·德鲁苏斯·苏伯维
媒体总指挥
2006-01
2021-09
30,556.04
15.7
import csv
#opening file
file = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt") #open file
counter = 0
total_lines = len(file.readlines()) #count lines
#print('Total lines:', x)
#reading from file
content = file.read()
colist = content.split ()
print(colist)
#read data from data1.txt and write in data2.txt
lines = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt")
arr = []
with open('data2.txt', 'w') as f:
for line in lines:
#arr.append(line)
f.write (line)我对编程很陌生,我不知道如何将我的逻辑转换成代码。
发布于 2022-04-07 14:23:24
您的代码不会收集多行代码来将它们写入一个行。
采用这种方法:
创建数据文件:
with open ("t.txt","w") as f:
f.write("""00000000886\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-08\n2021-09\n30,556.04\n15.7
00000000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7
00100000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7""")程序:
import csv
with open("t.csv","w",newline="") as wr, open("t.txt") as r:
# create a csv writer
writer = csv.writer(wr)
# uncomment if you want a header over your data
# h = ["ID","Name","Work position","Date 1","Date 2",
# "Gross payment","Service time"]
# writer.writerow(h)
person = []
for line in r: # could use enumerate as well, this works ok
# collect line data minus the \n into list
person.append(line.strip())
# this person is finished, write, clear list
if len(person) == 7:
# leveraged the csv module writer, look it up if you need
# to customize it further regarding quoting etc
writer.writerow(person)
person = [] # reset list for next person
# something went wrong, your file is inconsistent, write remainder
if person:
writer.writerow(person)
print(open("t.csv").read())输出:
00000000886,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-08,2021-09,"30,556.04",15.7
00000000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
00100000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7Readup:csv模块-写入器
需要引用“毛额支付”,因为它包含一个',',它是csv的分隔符--模块自动地这样做。
发布于 2022-04-07 15:21:07
在@PatrickArtner的极好的回答之上,我想提出一个itertools-based解决方案:
import csv
import itertools
def file_grouper_itertools(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,\
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
args = [iter(in_file)] * size
for block in itertools.zip_longest(*args, fillvalue=' '):
# equivalent, for the given input, to:
# block = [x.rstrip('\n') for x in block]
block = ''.join(block).rstrip('\n').split('\n')
writer.writerow(block)这里的想法是在所需大小的块中循环。对于较大的组大小,这会变得更快,因为执行主循环的周期较短。
运行一些微基准测试表明,与手动循环(适应为功能)相比,这种方法会使您的用例受益:
import csv
def file_grouper_manual(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,\
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
block = []
for line in in_file:
block.append(line.rstrip('\n'))
if len(block) == size:
writer.writerow(block)
block = []
if block:
writer.writerow(block)基准:
n = 100_000
k = 7
with open ("t.txt", "w") as f:
for i in range(n):
f.write("\n".join(["0123456"] * k))
%timeit file_grouper_manual()
# 1 loop, best of 5: 325 ms per loop
%timeit file_grouper_itertools()
# 1 loop, best of 5: 230 ms per loop或者,您可以使用Pandas,这非常方便,但需要将所有输入都放入可用内存中(在您的情况下,这不应该是一个问题,而是可以用于更大的输入):
import numpy as np
import pandas as pd
def file_grouper_pandas(in_filepath="t.txt", out_filepath="t.csv", size=7):
with open(in_filepath) as in_filepath:
data = [x.rstrip('\n') for x in in_filepath.readlines()]
df = pd.DataFrame(np.array(data).reshape((-1, size)), columns=list(range(size)))
# consistent with the other solutions
df.to_csv(out_filepath, header=False, index=False)
%timeit file_grouper_pandas()
# 1 loop, best of 5: 666 ms per loop发布于 2022-04-07 14:46:19
如果您对表和数据做了大量的工作,那么NumPy和Pandas是非常有用的库。
import numpy as np
import pandas as pd
columns = ['ID', 'Name' , 'Work position', 'Date 1 (year - month)', 'Date 2 (year - month)',
'Gross payment', 'Service time']
with open('oldfile.txt', 'r') as stream:
# read file into a list of lines
lines = stream.readlines()
# remove newline character from each element of the list.
lines = [line.strip('\n') for line in lines]
# Figure out how many rows there will be in the table
number_of_people = len(lines)/7
# Split data into rows
data = np.array_split(lines, number_of_people)
# Convert data to pandas dataframe
df = pd.DataFrame(data, columns = columns)一旦您将数据转换为Pandas Dataframe,就可以轻松地将其输出到您列出的任何格式。例如,要输出到csv,您可以这样做:
df.to_csv('newfile.csv')或者对json来说是:
df.to_json('newfile.csv')https://stackoverflow.com/questions/71783782
复制相似问题