我有以下数据,它使用\x01作为字段分隔符,\x02\n作为行分隔符。以下是数据的一个示例:
#export_date\x01artist_id\x01name\x01is_actual_artist\x01view_url\x01artist_type_id\x02\n#primaryKey:artist_id\x02\n
#dbTypes:BIGINT\x01INTEGER\x01VARCHAR(1000)\x01BOOLEAN\x01VARCHAR(1000)\x01INTEGER\x02\n#exportMode:INCREMENTAL\x02\n
1475226000146\x011120695691\x01Kinitic SA\x011\x01http://itunes.apple.com/artist/kinitic-sa/id1120695691?uo=5\x017\x02\n然而,当我尝试用csv模块解析它时,我得到了以下结果:
with open('myfile', 'r') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
print(dialect.__dict__)mappingproxy({'module':'csv','_name':‘嗅探',’line端子‘:'\r\n',’引号‘:0,'doc':None,’双引号‘:False,’定界符‘:’‘,'quotechar':’‘,’‘初始值空间’:False})
不幸的是,这是错误的,因为它认为分隔符是一个空格(即使我增加缓冲区大小,也是错误的)。
是否有比使用该模块更准确的方法来确定分隔符和线终止符?
发布于 2018-12-17 22:52:59
这很麻烦,但您可以为这些可能的分隔符计算输入流中的字符数。例如:
import collections
SEPARATORS=['\x00', '\x01', '\x02\n', '^', ':', ',', '\t', ':', ';', '|', '~']
def count_separator(filename, separators=SEPARATORS):
with open(filename, 'r') as f:
text = f.read(1024*1024)
counts = collections.Counter(c for c in text if c in SEPARATORS)
print (counts)
return c.most_common()[0][0]
>>> count_separator('/Users/david/Desktop/validate_headers/artist')
Counter({'\x01': 48549, ':': 9752, '\x02': 9741, ',': 295, ';': 3})
'\x01'上面badger0053建议的另一个选项是只对嗅探器使用第一行数据。这似乎要好得多:
SEPARATORS=['\x00', '\x01', '^', ':', ',', '\t', ':', ';', '|', '~', ' ']
LINE_TERMINATORS_IN_ORDER = ['\x02\n', '\r\n', '\n', '\r']
with open('/Users/david/Desktop/validate_headers/artist', 'r') as csvfile:
line = next(csvfile)
dialect = csv.Sniffer().sniff(line, SEPARATORS)
for _terminator in LINE_TERMINATORS_IN_ORDER:
if line.endswith(_terminator):
terminator = _terminator
break
print(repr(dialect.delimiter), repr(terminator))'\x01‘'\x02\n’
https://stackoverflow.com/questions/53823563
复制相似问题