首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >网页抓取HTML ->

网页抓取HTML ->
EN

Stack Overflow用户
提问于 2019-07-27 04:50:43
回答 1查看 142关注 0票数 0

我自己也是Python的新手。出于训练的目的,我正在尝试从一个网站上搜集一些数据。深入研究这个网站的HTML/CSS告诉我,这并不是那么简单,因为大多数div等都没有class或ID。

代码语言:javascript
复制
<table class="trade-list-table max-width">
<thead>
</thead>

<tbody>

<tr class="cursor-pointer" data-on-click-link="/pc/Trade/Detail/313809613" data-on-click-link-action="NewWindow" data-toggle="tooltip" data-original-title="" title="">
<td>
<img class="trade-item-icon item-quality-legendary" alt="Icon" src="./Search Result - Tamriel Trade Centre_files/crafting_outfitter_potion_014.png" data-original-title="" title="">
<div class="item-quality-legendary">
XXSTRING1XX
</div>
<div>
Level:
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/nonvet.png">
XXSTRING2XX
</div>
</td>

<td class="hidden-xs">
<div class="text-small-width                     text-danger">
XXSTRING3XX
</div>
</td>

<td class="hidden-xs">
<div>
XXSTRING4XX
</div>
<div>
XXSTRING5XX
</div>
</td>

<td class="gold-amount bold">
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/gold.png">
XXSTRING6XX
<div class="text-danger">
X
</div>
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/amount.png">
XXSTRING7XX
<div class="text-danger">
=
</div>
<img class="small-icon" src="./Search Result - Tamriel Trade Centre_files/gold.png">
54,999
</td>

<td class="bold hidden-xs" data-mins-elapsed="2">Now</td>
</tr>

我试过很多方法。在过去的7天里,我一直在挣扎。当我打印结果时,我需要XXSTRING1XX直到XXSTRING7XX,这样我才能将它们推入.csv文件或类似的文件中。

我一直遇到的困难是,大多数div没有特定的类。在大多数情况下,我无法返回字符串。

我一直在使用Python处理来自bs4的请求和BeautifulSoup。

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

page = requests.get('https://eu.tamrieltradecentre.com/pc/Trade/SearchResult?ItemID=211&SearchType=Sell&ItemNamePattern=Dreugh+Wax&ItemCategory1ID=&ItemCategory2ID=&ItemCategory3ID=&ItemTraitID=&ItemQualityID=&IsChampionPoint=false&LevelMin=&LevelMax=&MasterWritVoucherMin=&MasterWritVoucherMax=&AmountMin=&AmountMax=&PriceMin=&PriceMax=')
soup = BeautifulSoup(page.content, 'html.parser')

container = soup.find(class_="trade-list-table max-width")

itembox = container.find_all(class_="cursor-pointer")

item = itembox[0]

# Select all table rows and first TD
tr = container.find_all(class_="cursor-pointer")
tr1 = tr[0].find_all('td')

# Itemname
itemname = item.find('div', class_="item-quality-legendary").get_text()
print (itemname)

# Itemlevel + level type
# Tradername
# Location
# Guild name
# Unit price
# Quantity
# Total price
# Timestamp?
EN

回答 1

Stack Overflow用户

发布于 2019-07-27 08:45:36

编辑由于您要从某些数据源中查找特定字符串,例如,假设一个包含未知字符串的文本文件,则:

file.txt

代码语言:javascript
复制
some
unknown
strings
to
look
for
...

bs.py

代码语言:javascript
复制
import re
from bs4 import BeautifulSoup

filename = 'file.txt'  # file containing unknown strings
data = []
with open(filename, 'r') as f:  # open file
    data = f.readlines()
data = [line.strip('\n') for line in data]  # ['some','unknown','strings','to','look','for',...]

src = request.get(...)
soup = BeautifulSoup(src, 'html.parser')
results = []

for target in data:
    result = soup.find_all(string=re.compile(target))  # look at documentation for other functionalities!
    if result:  # if any results are found
        for string in result:
            string = string.split()  # cleanup
            results.append(string)
    else:  # no results found
        results.append(result)
print(results)  # do something

这应该会让您大致了解要做什么。如果您仍然不确定,请查看BS4的文档。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57226861

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档