我有一个非常大的CSV文件,如下所示:
rownum, id, first, last, age, ADDRESS, weight, hair, pet, food
1, 123450, John, Bingo, 47, 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US, 299, red, cat, lasagna
2, 125379, Joe, Durante, 61, 19345 S. 1st Ave., Seattle, WA, 16748, 180, blonde, dog, hotdogs
3, 197572, James, Gringo, 39, 123 Maypole St., Northside Castle, upper east side, NY, NY 30594, 202, brown, dog, lo-mein
4, 129358, Jim, Dingus, 22, 0985 Martyr Ave, Fancytown, MA 49436, USA, 163, brown, goldfish, hamburgers
5, 987543, Dwayne 'The Rock', Johnson, 42, 555 Fitness Ln, Los Angeles, CA, 90210, 260, black, dog, steak
6, 048573, Jean, Grey, 33, 987 X-Men Rd., Rm. 3F, outside boston?, MA 34972, 130, red, <null>, salad
7, 756432, Jose, Cuervo, 59, 444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?, 145, black, dog, margaritas
8, 845384, Junebug, Messerschmit, 2, 22nd Ave N, Boston, MA 45678, 130, blonde, turtle, lollipops
9, 634839, Jimbo, Humboldt, 99, 111 1st Street Kansas City KS 84638, 220, brown, ferrets, tacos
10, 483629, Julius, Caesar, 30, Emperors Estate in Ancient Rome, 145, brown, servants, grapes由于额外的逗号,我在解析ADDRESS列时遇到了困难。我想要的输出如下所示:
rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US| 299| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave., Seattle, WA, 16748 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St., Northside Castle, upper east side, NY, NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave, Fancytown, MA 49436, USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln, Los Angeles, CA, 90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd., Rm. 3F, outside boston?,. MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd., agave_town, Mexico, not, sure, what| their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N, Boston, MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes它不需要管道分隔,我只需要它的格式,Excel可以正确读取。我无法用文本导入向导在Excel中完成它,但也许我遗漏了什么?我是不是忽略了一个简单的解决方案?
首先,我想我可以使用Notepad++简单地做一个regex查找和替换(例如,替换第一个",“”使用一个“,转到下一行,重复。跑5次。然后从每一行的末尾向后执行regex,并运行4次。我没能把这个弄到手。
现在我试着用蟒蛇和熊猫来做,但是我并没有走多远,因为我对蟒蛇还是很陌生的。我对python、csv/文本文件读/写、每一行的regex迭代等方面没有多少经验。
我确信有不同的方法来解决这个问题(例如,从开始到结束,在第5个逗号之后添加dbl引号转义字符,在第4个逗号之前添加)。
到目前为止,我的笔记本上有一堆东西。
## Idea01a: The user defines the delimiter character and proper qty per line
delimiter=input("Type delimiter example here, then press enter: ")
l=input("paste 1st line of csv here (e.g. column headers only), then press enter: ")
d={}
## print(l)
for i in l:
if i not in d:
d[i]=l.count(i)
else:
pass
qty_proper_delimiters_total = (d[delimiter])
print("Delimiter character chosen:")
print(delimiter)
print("Proper number of delimiters:")
print(qty_proper_delimiters_total)
## Idea01b: User defines problem column
bad_column=input("""Enter afflicted column number, then press enter:
(e.g., A=1, B=2, C=3, D=4, E=5, F=6, G=7, etc.)""")
print(bad_column)
qty_proper_delimiters_before_bad_column = (int(bad_column)-1)
print("Proper qty commas BEFORE bad column:")
print(qty_proper_delimiters_before_bad_column)
qty_proper_delimiters_after_bad_column = (qty_proper_delimiters_total-(int(bad_column)-1))
print("Proper qty commas AFTER bad column:")
print(qty_proper_delimiters_after_bad_column)
## Idea02: Insert escape character just right/left of flanking commas
## (iterate n1 times from line start, then n2 times backwards from line end)
n1 = qty_proper_delimiters_before_bad_column
n2 = qty_proper_delimiters_after_bad_column
txt = input("Copy/Paste entire CSV here:")
## need to figure out how to iterate line by line谢谢你的帮助。
发布于 2022-08-23 20:09:12
使用新的数据,我将根据我在评论部分中所写的想法发布一个答案:您可能可以通过删除第一个和最后(一些固定数量的)字段来识别地址部分。因此,尝试以下步骤,并仔细检查它是否实现了您想要的目标。
import csv
with open('foo.csv') as f:
for record in csv.reader(f):
print(*record[:5], ', '.join(record[5:-4]), *record[-4:], sep='|')输出:
rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US| 29
9| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave., Seattle, WA, 16748| 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St., Northside Castle, upper east side, NY, NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave, Fancytown, MA 49436, USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln, Los Angeles, CA, 90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd., Rm. 3F, outside boston?, MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N, Boston, MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes发布于 2022-08-23 21:11:17
我这么做纯粹是为了joy的挑战。我不知道它是否有效。如果有很大帮助--否则,J1-Lee的答案可能会很好。
import os
if os.path.exists("new_csv.csv"):
os.remove("new_csv.csv")
with open("my_csv.csv", "r") as f:
contents = f.readlines()
headers = [e.strip().lower() for e in contents[0].split(",")]
address = headers.index("address")
with open("new_csv.csv", "a") as f:
for e in contents[1:]:
c = e.split(",")
f.writelines(f"""{"|".join(c[:address])}|{", ".join(c[address:(len(c) - len(headers) + address + 1)])}|{"|".join(c[(len(c) - len(headers) + address + 1):])}""")发布于 2022-08-23 22:16:29
只要address字段是唯一一个有额外逗号的字段,这将清理并编写一个CSV文件,该文件将在Excel中很好地读取:
import csv
with (open('input.csv', newline='') as fin,
open('output.csv', 'w', newline='') as fout):
r = csv.reader(fin, skipinitialspace=True) # handles non-standard comma-space separator
w = csv.writer(fout)
# header doesn't need extra processing
header = next(r)
w.writerow(header)
# *address will capture any extra columns due to commas in address.
for rownum, _id, first, last, age, *address, weight, hair, pet, food in r:
# the extra address columns will be joined into one column.
w.writerow([rownum, _id, first, last, age, ', '.join(address), weight, hair, pet, food])输出:
outrownum,id,first,last,age,ADDRESS,weight,hair,pet,food
1,123450,John,Bingo,47,"123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US",299,red,cat,lasagna
2,125379,Joe,Durante,61,"19345 S. 1st Ave., Seattle, WA, 16748",180,blonde,dog,hotdogs
3,197572,James,Gringo,39,"123 Maypole St., Northside Castle, upper east side, NY, NY 30594",202,brown,dog,lo-mein
4,129358,Jim,Dingus,22,"0985 Martyr Ave, Fancytown, MA 49436, USA",163,brown,goldfish,hamburgers
5,987543,Dwayne 'The Rock',Johnson,42,"555 Fitness Ln, Los Angeles, CA, 90210",260,black,dog,steak
6,048573,Jean,Grey,33,"987 X-Men Rd., Rm. 3F, outside boston?, MA 34972",130,red,<null>,salad
7,756432,Jose,Cuervo,59,"444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?",145,black,dog,margaritas
8,845384,Junebug,Messerschmit,2,"22nd Ave N, Boston, MA 45678",130,blonde,turtle,lollipops
9,634839,Jimbo,Humboldt,99,111 1st Street Kansas City KS 84638,220,brown,ferrets,tacos
10,483629,Julius,Caesar,30,Emperors Estate in Ancient Rome,145,brown,servants,grapeshttps://stackoverflow.com/questions/73464196
复制相似问题