首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >"data_in.read().replace("<*","<").replace("*\n",“”)行丢失记录“

"data_in.read().replace("<*","<").replace("*\n",“”)行丢失记录“
EN

Stack Overflow用户
提问于 2015-10-10 23:44:41
回答 1查看 46关注 0票数 0

在运行下面的代码后,我一直在试图弄清楚为什么数据库中缺少47条700+记录。请帮助查看这是编码错误还是Python中的内存限制。

代码语言:javascript
复制
def create_csv_file():
    source_html = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Raw).txt', 'r')
    bs_object = BeautifulSoup(source_html, "html.parser")

    data_out = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\temp.csv', 'w+')
    data_in = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\temp.csv', 'r')
    csv_file1 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Processed).csv', 'w+')
    csv_file2 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Processed).csv', 'r')
    csv_file3 = open(r'C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\EMA - Electricians (Processed).csv', 'w+')

    writer1 = csv.writer(data_out, delimiter='<', skipinitialspace=True)

    table = bs_object.find("table", {"id":"gasOfferSearch"})
    rows = table.findAll("tr")

    try:
        # Iterates through the list, but skips the first record (i.e. the table header)
        for row in rows[1:]:
            csvRow = []
            for cell in row.findAll(['td','th']):
                # Replace "\n" with a whitespace; replace <br> tags with 5 whitespaces
                line = str(cell).replace('\n', ' ').replace('<br>', '     ')
                # Replace 2 or more spaces with "\n"
                line = re.sub('\s{2,}', '*', line)
                # Converts results to a BeautifulSoup object
                line_bsObj = BeautifulSoup(line, "html.parser")
                # Strips: Removes all tags and trailing and leading whitespaces
                # Replace: Removes all quotation marks
                csvRow.append(line_bsObj.get_text().strip().replace('"',''))

            # Converts the string into a csv file
            writer1.writerow(csvRow)

        # Reads from the temp file and replaces all "<*" with "<"
        # TODO: Issue - 47 records missing with replacement
        temp_string = data_in.read().replace("<*", "<").replace("*\n", "")
        csv_file1.write(temp_string)

        # Clear the temp_string variable
        temp_string = ""
        for line in csv_file2.readlines():
            temp_string += line.replace("*", "<", 1)

        csv_file3.write(temp_string)

    finally:
        source_html.close()
        csv_file1.close()
        csv_file2.close()
        data_out.close()
        data_in.close()

        # Remove the temp file
        # os.remove('C:\\Users\\Admin\\SkyDrive\\eCommerce\\Servi-fied\\Raw Data\\temp.csv')

    return None
EN

回答 1

Stack Overflow用户

发布于 2015-10-11 08:17:34

我不知道到底是哪里出了问题,但这里有一些一般性的建议:

  • 不要同时打开同一文件三次(csv_file[1,2,3]是相同的)
  • add print命令用于仔细检查发生了什么:
    • 将一个命令放在打印总行数的for now in rows前面
    • 将它们放在temp_string = data_in...周围,以确保这些数字是correct

  • 如果所有这些都不能显示问题,请发布一些示例记录以供我们查看
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/33055782

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档