我想以csv的形式读取一个文本文件,但是文件的开头是纯文本(选项卡分隔的值开始于第20行附近)。因为将来可能会更新该文件,所以我希望熊猫自动在正确的行开始读取该文件(目前,我显然获得了一个解析器错误)。是否有一种方法来配置读取,以便在文件开始时找到要跳过的行数?(也许是通过固定列的数量?)谢谢!
编辑:下面是文本文件的开头。可以看到,前30行并没有分开。目前,我只是用一个固定的值跳过了它们,但是我担心文件可能会被更新,这个数字可能会改变。
#
# Unihan_IRGSources.txt
# Date: 2021-08-06 16:32:36 GMT [JHJ]
# Unicode version: 14.0.0
#
# Unicode Character Database
# © 2021 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see http://www.unicode.org/terms_of_use.html
# For documentation, see http://www.unicode.org/reports/tr38/
#
# This file contains data on the following fields from the Unihan database:
# kCompatibilityVariant
# kIICore
# kIRG_GSource
# kIRG_HSource
# kIRG_JSource
# kIRG_KPSource
# kIRG_KSource
# kIRG_MSource
# kIRG_SSource
# kIRG_TSource
# kIRG_UKSource
# kIRG_USource
# kIRG_VSource
# kRSUnicode
# kTotalStrokes
#
# For details on the file format, see http://www.unicode.org/reports/tr38/
#
U+3400 kIRG_GSource GKX-0078.01
U+3400 kIRG_JSource JA-2121
U+3400 kIRG_TSource T6-222C
U+3400 kRSUnicode 1.4
U+3400 kTotalStrokes 5
U+3401 kIRG_GSource G5-3024
U+3401 kIRG_KSource K3-2121
U+3401 kIRG_TSource T4-2224
U+3401 kRSUnicode 1.5
U+3401 kTotalStrokes 6
U+3402 kIRG_JSource JA3-2E23
U+3402 kRSUnicode 1.5
U+3402 kTotalStrokes 6
U+3403 kIRG_KSource K3-2122
U+3403 kRSUnicode 2.2
U+3403 kTotalStrokes 3
U+3404 kIRG_GSource GKX-0079.02
U+3404 kIRG_JSource JA-2123
U+3404 kIRG_TSource T6-2130
U+3404 kRSUnicode 2.2
U+3404 kTotalStrokes 3
U+3405 kIRG_GSource GKX-0081.18
U+3405 kIRG_JSource JA-2124
U+3405 kIRG_TSource T6-2123
U+3405 kRSUnicode 4.1
U+3405 kTotalStrokes 2发布于 2022-07-16 07:46:04
如果只有带有制表符的行是有效的,那么这将起作用:
with open('file.csv') as f:
df = pd.DataFrame(line.strip().split('\t') for line in f if '\t' in line)https://stackoverflow.com/questions/73002377
复制相似问题