首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在csv文件的一列中修复额外逗号的最简单方法

在csv文件的一列中修复额外逗号的最简单方法
EN

Stack Overflow用户
提问于 2022-08-23 19:29:33
回答 3查看 104关注 0票数 2

我有一个非常大的CSV文件,如下所示:

代码语言:javascript
复制
rownum, id, first, last, age, ADDRESS, weight, hair, pet, food
1, 123450, John, Bingo, 47, 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US, 299, red, cat, lasagna
2, 125379, Joe, Durante, 61, 19345 S. 1st Ave., Seattle, WA, 16748, 180, blonde, dog, hotdogs
3, 197572, James, Gringo, 39, 123 Maypole St., Northside Castle, upper east side, NY, NY 30594, 202, brown, dog, lo-mein
4, 129358, Jim, Dingus, 22, 0985 Martyr Ave, Fancytown, MA 49436, USA, 163, brown, goldfish, hamburgers
5, 987543, Dwayne 'The Rock', Johnson, 42, 555 Fitness Ln, Los Angeles, CA, 90210, 260, black, dog, steak
6, 048573, Jean, Grey, 33, 987 X-Men Rd., Rm. 3F, outside boston?, MA 34972, 130, red, <null>, salad
7, 756432, Jose, Cuervo, 59, 444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?, 145, black, dog, margaritas
8, 845384, Junebug, Messerschmit, 2, 22nd Ave N, Boston, MA 45678, 130, blonde, turtle, lollipops
9, 634839, Jimbo, Humboldt, 99, 111 1st Street Kansas City KS 84638, 220, brown, ferrets, tacos
10, 483629, Julius, Caesar, 30, Emperors Estate in Ancient Rome, 145, brown, servants, grapes

由于额外的逗号,我在解析ADDRESS列时遇到了困难。我想要的输出如下所示:

代码语言:javascript
复制
rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US| 299| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave., Seattle, WA, 16748 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St., Northside Castle, upper east side, NY, NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave, Fancytown, MA 49436, USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln, Los Angeles, CA, 90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd., Rm. 3F, outside boston?,. MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd., agave_town, Mexico, not, sure, what| their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N, Boston, MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes

它不需要管道分隔,我只需要它的格式,Excel可以正确读取。我无法用文本导入向导在Excel中完成它,但也许我遗漏了什么?我是不是忽略了一个简单的解决方案?

首先,我想我可以使用Notepad++简单地做一个regex查找和替换(例如,替换第一个",“”使用一个“,转到下一行,重复。跑5次。然后从每一行的末尾向后执行regex,并运行4次。我没能把这个弄到手。

现在我试着用蟒蛇和熊猫来做,但是我并没有走多远,因为我对蟒蛇还是很陌生的。我对python、csv/文本文件读/写、每一行的regex迭代等方面没有多少经验。

我确信有不同的方法来解决这个问题(例如,从开始到结束,在第5个逗号之后添加dbl引号转义字符,在第4个逗号之前添加)。

到目前为止,我的笔记本上有一堆东西。

代码语言:javascript
复制
## Idea01a: The user defines the delimiter character and proper qty per line

delimiter=input("Type delimiter example here, then press enter: ")
l=input("paste 1st line of csv here (e.g. column headers only), then press enter: ")
d={}
## print(l)
for i in l:
 if i not in d:
  d[i]=l.count(i)
 else:
  pass

qty_proper_delimiters_total = (d[delimiter])
print("Delimiter character chosen:")
print(delimiter)
print("Proper number of delimiters:")
print(qty_proper_delimiters_total)

## Idea01b: User defines problem column

bad_column=input("""Enter afflicted column number, then press enter: 
(e.g., A=1, B=2, C=3, D=4, E=5, F=6, G=7, etc.)""")
print(bad_column)
qty_proper_delimiters_before_bad_column = (int(bad_column)-1)
print("Proper qty commas BEFORE bad column:")
print(qty_proper_delimiters_before_bad_column)
qty_proper_delimiters_after_bad_column = (qty_proper_delimiters_total-(int(bad_column)-1))
print("Proper qty commas AFTER bad column:")
print(qty_proper_delimiters_after_bad_column)

## Idea02: Insert escape character just right/left of flanking commas 

## (iterate n1 times from line start, then n2 times backwards from line end)
n1 = qty_proper_delimiters_before_bad_column
n2 = qty_proper_delimiters_after_bad_column

txt = input("Copy/Paste entire CSV here:")
## need to figure out how to iterate line by line

谢谢你的帮助。

EN

回答 3

Stack Overflow用户

发布于 2022-08-23 20:09:12

使用新的数据,我将根据我在评论部分中所写的想法发布一个答案:您可能可以通过删除第一个和最后(一些固定数量的)字段来识别地址部分。因此,尝试以下步骤,并仔细检查它是否实现了您想要的目标。

代码语言:javascript
复制
import csv

with open('foo.csv') as f:
    for record in csv.reader(f):
        print(*record[:5], ', '.join(record[5:-4]), *record[-4:], sep='|')

输出:

代码语言:javascript
复制
rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St.,  Waverly Place Apts,  PO Box 12345,  Apt#5E,  Upper-Ontario,  Eastern Province A12-E765,  Not Puerto Rico,  US| 29
9| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave.,  Seattle,  WA,  16748| 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St.,  Northside Castle,  upper east side,  NY,  NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave,  Fancytown,  MA 49436,  USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln,  Los Angeles,  CA,  90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd.,  Rm. 3F,  outside boston?,  MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd.,  agave_town,  Mexico,  not,  sure,  what,  their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N,  Boston,  MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes
票数 2
EN

Stack Overflow用户

发布于 2022-08-23 21:11:17

我这么做纯粹是为了joy的挑战。我不知道它是否有效。如果有很大帮助--否则,J1-Lee的答案可能会很好。

代码语言:javascript
复制
import os

if os.path.exists("new_csv.csv"):
    os.remove("new_csv.csv")

with open("my_csv.csv", "r") as f:
    contents = f.readlines()

headers = [e.strip().lower() for e in contents[0].split(",")]
address = headers.index("address")

with open("new_csv.csv", "a") as f:
    for e in contents[1:]:
        c = e.split(",")
        f.writelines(f"""{"|".join(c[:address])}|{", ".join(c[address:(len(c) - len(headers) + address + 1)])}|{"|".join(c[(len(c) - len(headers) + address + 1):])}""")
票数 1
EN

Stack Overflow用户

发布于 2022-08-23 22:16:29

只要address字段是唯一一个有额外逗号的字段,这将清理并编写一个CSV文件,该文件将在Excel中很好地读取:

代码语言:javascript
复制
import csv

with (open('input.csv', newline='') as fin,
      open('output.csv', 'w', newline='') as fout):

    r = csv.reader(fin, skipinitialspace=True) # handles non-standard comma-space separator
    w = csv.writer(fout)

    # header doesn't need extra processing
    header = next(r)
    w.writerow(header)

    # *address will capture any extra columns due to commas in address.
    for rownum, _id, first, last, age, *address, weight, hair, pet, food in r:
        # the extra address columns will be joined into one column.
        w.writerow([rownum, _id, first, last, age, ', '.join(address), weight, hair, pet, food])

输出:

代码语言:javascript
复制
outrownum,id,first,last,age,ADDRESS,weight,hair,pet,food
1,123450,John,Bingo,47,"123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US",299,red,cat,lasagna
2,125379,Joe,Durante,61,"19345 S. 1st Ave., Seattle, WA, 16748",180,blonde,dog,hotdogs
3,197572,James,Gringo,39,"123 Maypole St., Northside Castle, upper east side, NY, NY 30594",202,brown,dog,lo-mein
4,129358,Jim,Dingus,22,"0985 Martyr Ave, Fancytown, MA 49436, USA",163,brown,goldfish,hamburgers
5,987543,Dwayne 'The Rock',Johnson,42,"555 Fitness Ln, Los Angeles, CA, 90210",260,black,dog,steak
6,048573,Jean,Grey,33,"987 X-Men Rd., Rm. 3F, outside boston?, MA 34972",130,red,<null>,salad
7,756432,Jose,Cuervo,59,"444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?",145,black,dog,margaritas
8,845384,Junebug,Messerschmit,2,"22nd Ave N, Boston, MA 45678",130,blonde,turtle,lollipops
9,634839,Jimbo,Humboldt,99,111 1st Street Kansas City KS 84638,220,brown,ferrets,tacos
10,483629,Julius,Caesar,30,Emperors Estate in Ancient Rome,145,brown,servants,grapes
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73464196

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档