文章/答案/技术大牛

发布

社区首页 >问答首页 >Python正在过滤货币标记。

问Python正在过滤货币标记。
EN

Stack Overflow用户

提问于 2016-01-10 20:18:12

回答 1查看 146关注 0票数 2

目标：编写一个屏幕爬行器，通过包含旧价格和新价格的网页进行循环，读取价格，并将它们写入CSV文件。

方法：配置文件urls.txt包含页面列表。打开该文件并循环通过URL。对于每个URL，使用Beautiful提取类“当前价格”和“旧价格”的任何div的内容。并不是所有的页面都会有旧的价格，所以我做了选择。

Problem：一切正常，但有一个奇怪的例外。如果价格是以美元为单位的，那么价格和美元符号就会出现。在价格以欧元或英镑计价的地方，国标€和欧元正在被剥离。我希望货币标记在任何情况下都能通过。我怀疑这是个编码问题。(下面的lstrip调用将删除正在通过的一些错误空格和制表符。)

含量urls.txt:

http://uk.norton.com/norton-security-for-one-device
http://uk.norton.com/norton-security-antivirus
http://uk.norton.com/norton-security-with-backup
http://us.norton.com/norton-security-for-one-device
http://us.norton.com/norton-security-antivirus
http://us.norton.com/norton-security-with-backup
http://ie.norton.com/norton-security-for-one-device
http://ie.norton.com/norton-security-antivirus
http://ie.norton.com/norton-security-with-backup

Python代码：

###############################################
#
# PRICESCRAPE
# Screenscraper that checks prices on PD pages
#
###############################################

# Import the modules we need
import urllib.request
import re
import lxml
from lxml import etree
from lxml.html.soupparser import fromstring
from lxml.etree import tostring
from lxml.cssselect import CSSSelector
from bs4 import BeautifulSoup, NavigableString

# Open the files we need
out = open('out.csv', 'w')
urls=open('urls.txt','r')

# function to take a URL, open the HTML, and return it
def getPage(url):
    return urllib.request.urlopen(url).read().decode(encoding='UTF-8',errors='strict').encode('ascii','ignore')

out.write('URL,Current Price,Strikethrough Price\n')



#Loop through the URLs
for url in urls:
    print('\nExamining ' + url) 
    url=url.rstrip('\n')
    html=getPage(url)
    soup = BeautifulSoup(html,'lxml')
    currentPrice = soup.find('div', {'class': 'current-price'}).contents[0].lstrip('\n').lstrip(' ').lstrip('\t')
    oldPrice = soup.find('div', {'class': 'old-price'}).contents[0].lstrip(' ')

    out.write(url)
    out.write(',')
    out.write(str(currentPrice))
    out.write(',')
    if oldPrice:
        out.write(str(oldPrice))
    else:
         out.write('No strikethrough price')
    out.write('\n')

    if html =='':
        print('Problem reading page')

print('Done. See out.csv for output')

out.close()
urls.close()

beautifulsoup

html-parsing

python

回答 1

Stack Overflow用户

发布于 2016-01-10 20:23:13

我将使用两个模块使其工作，并使代码更简单：

csv将结果导出到csv输出文件中
requests使编码部分对您透明

如果要使用import requests并将getPage实现替换为：

def getPage(url):
    return requests.get(url).content

你也会得到带有欧元和英镑标志的价格。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/34710320

复制

相似问题

问Python正在过滤货币标记。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python正在过滤货币标记。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python正在过滤货币标记。
EN