问从文件夹中读取文件并将其附加到xlsx文件
EN

Stack Overflow用户

提问于 2020-03-04 09:54:38

回答 1查看 19关注 0票数 0

我有一个文件夹，里面有几百个文件，而且每小时都在增长。我正在尝试将所有数据合并到一个文件中，以供分析使用。但是我编写的脚本对于处理这些数据并不是很有效，因为它将读取文件夹中的所有内容并将其附加到一个xlsx文件中。处理时间太长了。我寻求的是增强和改进我的脚本: 1)能够只读取和提取数据的新文件，以前没有读取过2)提取和附加这些数据来更新xlxs文件。

我只是需要一些启迪来帮助我改进脚本。

我的部分代码如下所示

import pandas as pd
import numpy as np
import os 
import dask.dataframe as dd
import glob
import schedule
import time 
import re
import datetime as dt


def job():
    # Select the path to download the files 
    path=r'V:\DB\ABCD\BEFORE\8_INCHES'
    files=glob.glob(path+"/*.csv")

    df=None
    # Extracting of information from files 
    for i, file in enumerate (files) :
        if i==0:
             df= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
             df['Path'] =file
             df['Machine No']=re.findall("MC-11",str(df["Path"]))
             df['Process']= re.findall("ABCD",str(df["Path"]))
             df['Before/After']=re.findall("BEFORE",str(df["Path"]))
             df['Wafer Size']=re.findall("8_INCHES",str(df["Path"]))
             df['Employee ID']=df["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
             df['Date']=df["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
             df['Lot Number']=df["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
             df['Part Number']=df["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
             df["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
        else:
            tmp= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
            tmp['Path'] =file
            tmp['Machine No']=tmp["Path"].str.extract(r'(\D{3}\d{2})',expand=False)
            tmp['Process']=  tmp["Path"].str.extract(r'(\w{8})',expand=False)
            tmp['Before/After']= tmp["Path"].str.extract(r'([B][E][F][O][R][E])',expand= False)
            tmp['Wafer Size']= tmp["Path"].str.extract(r'(\d\_\D{6})',expand= False)
            tmp['Employee ID']=tmp["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
            tmp['Date']=tmp["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
            tmp['Lot Number']=tmp["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
            tmp['Part Number']=tmp["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
            tmp["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)

            df= df.append(tmp)



        export_excel= rf.to_excel(r'C:\Users\hoosk\Documents\Python Scripts\hoosk\test26_feb_2020.xlsx')




#schedule to run every hour 
schedule.every(1).hour.do(job)


while True:
    schedule.run_pending()
    time.sleep(1)

python

data-manipulation

回答 1

Stack Overflow用户

发布于 2020-03-04 10:09:57

一般而言，您需要执行以下操作：

在script.

Extract的开头，

读取xlsx文件，该文件集包含您遍历的每个文件的所有文件名(路径属性)
检查它是否包含在已处理的文件集中。

这假设现有文件的内容没有更新。如果可能发生这种情况，您可能希望跟踪诸如上次更改日期之类的指标(校验和最可靠，但计算成本可能太高)。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60518016

复制

相似问题

问从文件夹中读取文件并将其附加到xlsx文件
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文件夹中读取文件并将其附加到xlsx文件EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文件夹中读取文件并将其附加到xlsx文件
EN