我正在用Python编写一些自动脚本来处理Excel文件,有些是XLS格式的。下面是我尝试使用Pandas的代码片段:
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])contents是从AWS S3桶中提取的文件内容。当这一行运行时,我得到了[ERROR] ValueError: File is not a recognized excel file。
在排除此问题时,我尝试使用xlrd直接访问电子表格:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))这是没有错误的,因此xlrd似乎将它识别为Excel文件,只是当Pandas要求它这样做时就不会这样做了。
有人知道为什么Pandas不以xlrd作为引擎来读取文件吗?或者有人能帮我把xlrd的工作表转换成Pandas dataframe吗?
发布于 2021-08-28 22:05:59
或者有人能帮我把xlrd的工作表转换成Pandas吗?
pd.read_excel可以拿一本书..。
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no如果您想检查/处理Excel文件类型,我将包括下面的代码。也许你可以根据你的需要来调整它。
代码循环遍历一个本地文件夹,并显示文件和扩展名,然后使用python-magic钻入其中。它也有一栏显示来自mimetypes的猜测,但这并不是很好。一定要放大帧的图像,并看到一些.xls并不是扩展程序说的那样。另外,.txt实际上是一个Excel文件。
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df

在那里,您可以根据文件的类型筛选文件:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))输出:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4https://stackoverflow.com/questions/68967386
复制相似问题