当一个单元格中有两个或更多的值时,从web中抓取表变得很复杂。为了保留表结构,我设计了一种方法来计数其xpath的行号索引,在行号保持不变时实现嵌套列表。
def get_structured_elements(name):
"""For target data that is nested and structured,
such as a table with multiple values in a single cell."""
driver = self.driver
i = 2 # keep track of 'i' to retain the document structure.
number_of_items = number_of_items_found()
elements = [None] * number_of_items # len(elements) will exceed number_of_items.
target_data = driver.find_elements("//table/tbody/tr[" + i + "]/td[2]/a")
while i - 2 < number_of_items:
for item in target_data:
# print(item.text, i-1)
if elements[i - 2] == None:
elements[i - 2] = item.text # set to item.text value if position is empty.
else:
elements[i - 2] = [elements[i - 2]]
elements[i - 2].append(item.text) # make nested list and append new value if position is occupied.
i += 1
return elements这个简单的逻辑工作得很好,直到我试图在一个地方管理所有的定位器变量,以使代码更可重用:如何将表达式"//table/tbody/tr[" + i + "]/td[2]/a"存储在列表或字典中,以便它在插入时仍然工作?
我想出的解决方案(即hack)是一个函数,它将迭代xpath的前后一半作为参数,如果front_half + str(i) + back_half是父(迭代器)函数局部变量的一部分,则返回i。
def split_xpath_at_i(front_half, back_half):
"""Splits xpath string at its counter index.
The 'else' part is to aviod errors
when this function is called outside an indexed environment. """
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"),
"//table/tbody/tr/td[3]/a[1]"
]
def xpath_index_iterator():
for i in range(10):
print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
xpath_index_iterator()
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a问题是,split_xpath_at_i对直接环境中的变量是盲目的。我最终想出的是利用迭代器函数的属性来定义计数器i,以便变量可以像这样提供给split_xpath_at_i:
def split_xpath_at_i(front_half, back_half):
"""Splits xpath string at its counter index.
The 'else' part is to aviod errors
when this function is called outside an indexed environment. """
try:
i = xpath_index_iterator.i
except:
pass
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"),
"//table/tbody/tr/td[3]/a[1]"
]
def xpath_index_iterator():
xpath_index_iterator.i = 0
lst = []
for xpath_index_iterator.i in range(10):
print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
xpath_index_iterator()
# //table/tbody/tr[0]/td[2]/a
# //table/tbody/tr[1]/td[2]/a
# //table/tbody/tr[2]/td[2]/a
# //table/tbody/tr[3]/td[2]/a
# //table/tbody/tr[4]/td[2]/a
# //table/tbody/tr[5]/td[2]/a
# //table/tbody/tr[6]/td[2]/a
# //table/tbody/tr[7]/td[2]/a
# //table/tbody/tr[8]/td[2]/a
# //table/tbody/tr[9]/td[2]/a当我试图通过一个定位器列表调用split_xpath_at_i时,问题变得更加复杂:
def split_xpath_at_i(front_half, back_half):
"""Splits xpath string at its counter index.
The 'else' part is to aviod errors
when this function is called outside an indexed environment. """
try:
i = xpath_index_iterator.i
except:
pass
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"),
"//table/tbody/tr/td[3]/a[1]"
]
def xpath_index_iterator():
xpath_index_iterator.i = 0
lst = []
for xpath_index_iterator.i in range(10):
# print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
lst.append(xpath[0])
return lst
xpath_index_iterator()
# ['//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a']以专业的方式解决这个问题是什么样子的?
下面的代码是从硒手册修改的。
我在这里上问了一个相关的问题,它涉及到页面对象设计的一般方法。
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from query import Input
import page
cnki = Input()
driver = cnki.webpage('http://big5.oversea.cnki.net/kns55/')
current_page = page.MainPage(driver)
current_page.submit_search('禮學')
current_page.switch_to_frame()
result = page.SearchResults(driver)
structured = result.get_structured_elements('titles') # I couldn't get this to work.
simple = result.simple_get_structured_elements() # but this works fine.#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium import webdriver
class Input:
"""This class provides a wrapper around actual working code."""
# CONSTANTS
URL = None
def __init__(self):
self.driver = webdriver.Chrome
def webpage(self, url):
driver = self.driver()
driver.get(url)
return driver#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from element import BasePageElement
from locators import InputLocators, OutputLocators
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
class SearchTextElement(BasePageElement):
"""This class gets the search text from the specified locator"""
#The locator for search box where search string is entered
locator = None
class BasePage:
"""Base class to initialize the base page that will be called from all
pages"""
def __init__(self, driver):
self.driver = driver
class MainPage(BasePage):
"""Home page action methods come here. I.e. Python.org"""
search_keyword = SearchTextElement()
def submit_search(self, keyword):
"""Submits keyword and triggers the search"""
SearchTextElement.locator = InputLocators.SEARCH_FIELD
self.search_keyword = keyword
def select_dropdown_item(self, item):
driver = self.driver
by, val = InputLocators.SEARCH_ATTR
driver.find_element(by, val + "/option[text()='" + item + "']").click()
def click_search_button(self):
driver = self.driver
element = driver.find_element(*InputLocators.SEARCH_BUTTON)
element.click()
def switch_to_frame(self):
"""Use this function to get access to hidden elements. """
driver = self.driver
driver.switch_to.default_content()
driver.switch_to.frame('iframeResult')
# Maximize the number of items on display in the search results.
def max_content(self):
driver = self.driver
max_content = driver.find_element_by_css_selector('#id_grid_display_num > a:nth-child(3)')
max_content.click()
def stop_loading_page_when_element_is_present(self, locator):
driver = self.driver
ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
wait = WebDriverWait(driver, 30, ignored_exceptions=ignored_exceptions)
wait.until(
EC.presence_of_element_located(locator))
driver.execute_script("window.stop();")
def next_page(self):
driver = self.driver
self.stop_loading_page_when_element_is_present(InputLocators.NEXT_PAGE)
driver.execute_script("window.stop();")
try:
driver.find_element(*InputLocators.NEXT_PAGE).click()
print("Navigating to Next Page")
except (TimeoutException, WebDriverException):
print("Last page reached")
class SearchResults(BasePage):
"""Search results page action methods come here"""
def __init__(self, driver):
self.driver = driver
i = None # get_structured_element counter
def wait_for_page_to_load(self):
driver = self.driver
wait = WebDriverWait(driver, 100)
wait.until(
EC.presence_of_element_located(*InputLocators.MAIN_BODY))
def get_single_element(self, name):
"""Returns a single value as target data."""
driver = self.driver
target_data = driver.find_element(*OutputLocators.CNKI[str(name.upper())])
# SearchTextElement.locator = OutputLocators.CNKI[str(name.upper())]
# target_data = SearchTextElement()
return target_data
def number_of_items_found(self):
"""Return the number of items found on a single page."""
driver = self.driver
target_data = driver.find_elements(*OutputLocators.CNKI['INDEX'])
return len(target_data)
def get_elements(self, name):
"""Returns simple list of values in specific data field in a table."""
driver = self.driver
target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
elements = []
for item in target_data:
elements.append(item.text)
return elements
def get_structured_elements(self, name):
"""For target data that is nested and structured,
such as a table with multiple values in a single cell."""
driver = self.driver
i = 2 # keep track of 'i' to retain the document structure.
number_of_items = self.number_of_items_found()
elements = [None] * number_of_items
while i - 2 < number_of_items:
target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
for item in target_data:
print(item.text, i - 1)
if elements[i - 2] == None:
elements[i - 2] = item.text
elif isinstance(elements[i - 2], list):
elements[i - 2].append(item.text)
else:
elements[i - 2] = [elements[i - 2]]
elements[i - 2].append(item.text)
i += 1
return elements
def simple_get_structured_elements(self):
"""Simple structured elements code with fixed xpath."""
driver = self.driver
i = 2 # keep track of 'i' to retain the document structure.
number_of_items = self.number_of_items_found()
elements = [None] * number_of_items
while i - 2 < number_of_items:
target_data = driver.find_elements_by_xpath\
('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr['\
+ str(i) + ']/td[2]/a')
for item in target_data:
print(item.text, i-1)
if elements[i - 2] == None:
elements[i - 2] = item.text
elif isinstance(elements[i - 2], list):
elements[i - 2].append(item.text)
else:
elements[i - 2] = [elements[i - 2]]
elements[i - 2].append(item.text)
i += 1
return elements#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.support.ui import WebDriverWait
class BasePageElement():
"""Base page class that is initialized on every page object class."""
def __set__(self, obj, value):
"""Sets the text to the value supplied"""
driver = obj.driver
text_field = WebDriverWait(driver, 100).until(
lambda driver: driver.find_element(*self.locator))
text_field.clear()
text_field.send_keys(value)
text_field.submit()
def __get__(self, obj, owner):
"""Gets the text of the specified object"""
driver = obj.driver
WebDriverWait(driver, 100).until(
lambda driver: driver.find_element(*self.locator))
element = driver.find_element(*self.locator)
return element.get_attribute("value")这是split_xpath_at_i坐的地方。
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
# import page
class InputLocators():
"""A class for main page locators. All main page locators should come here"""
def dropdown_list_xpath(attribute, value):
string = "//select[@" + attribute + "='" + value + "']"
return string
MAIN_BODY = (By.XPATH, '//GridTableContent/tbody')
SEARCH_FIELD = (By.NAME, 'txt_1_value1') # (By.ID, 'search-content-box')
SEARCH_ATTR = (By.XPATH, dropdown_list_xpath('name', 'txt_1_sel'))
SEARCH_BUTTON = (By.ID, 'btnSearch')
NEXT_PAGE = (By.LINK_TEXT, "下頁")
class OutputLocators():
"""A class for search results locators. All search results locators should
come here"""
def split_xpath_at_i(front_half, back_half):
# try:
# i = page.SearchResults.g_s_elem
# except:
# pass
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
CNKI = {
"TITLES": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[2]/a')),
"AUTHORS": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[3]/a')),
"JOURNALS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[4]/a',
"YEAR_ISSUE": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[5]/a',
"DOWNLOAD_PATHS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[1]',
"INDEX": (By.XPATH, '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]')
}
# # Interim Data
# CAPTIONS =
# LINKS =
# Target Data
# TITLES =
# AUTHORS =
# JOURNALS =
# VOL =
# ISSUE =
# DATES =
# DOWNLOAD_PATHS = 发布于 2021-06-08 01:17:27
首先:我通常建议您用直接的requests调用代替Selenium的使用。如果可能的话,它比硒高效得多。作为一个非常粗略的开端,它看起来如下:
from time import time
from typing import Iterable
from urllib.parse import quote
from requests import Session
def js_encode(u: str) -> Iterable[str]:
for char in u:
code = ord(char)
if code < 128:
yield quote(char).lower()
else:
yield f'%u{code:04x}'
def search(query: str):
topic = '主题'
# China Academic Literature Online Publishing Database
catalog = '中国学术文献网络出版总库'
databases = (
'中国期刊全文数据库,' # China Academic Journals Full-text Database
'中国博士学位论文全文数据库,' # China Doctoral Dissertation Full-text Database
'中国优秀硕士学位论文全文数据库,' # China Master's Thesis Full-text Database
'中国重要会议论文全文数据库,' # China Proceedings of Conference Full-text Database
'国际会议论文全文数据库,' # International Proceedings of Conference Full-text Database
'中国重要报纸全文数据库,' # China Core Newspapers Full-text Database
'中国年鉴网络出版总库' # China Yearbook Full-text Database
)
with Session() as session:
session.headers = {
'Accept':
'text/html,'
'application/xhtml+xml,'
'application/xml;q=0.9,'
'image/webp,'
'*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-CA,en-GB;q=0.8,en;q=0.5,en-US;q=0.3',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'DNT': '1',
'Host': 'big5.oversea.cnki.net',
'Pragma': 'no-cache',
'Sec-GPC': '1',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) '
'Gecko/20100101 '
'Firefox/89.0',
'Upgrade-Insecure-Requests': '1',
}
with session.get(
'https://big5.oversea.cnki.net/kns55/brief/result.aspx',
params={
'txt_1_value1': query,
'txt_1_sel': topic,
'dbPrefix': 'SCDB',
'db_opt': catalog,
'db_value': databases,
'search-action': 'brief/result.aspx',
},
) as response:
response.raise_for_status()
search_url = response.url
search_page = response.text
encoded_query = ''.join(js_encode(',' + query))
# epoch milliseconds
timestamp = round(time()*1000)
# page_params = {
# 'curpage': 1,
# 'RecordsPerPage': 20,
# 'QueryID': 0,
# 'ID': '',
# 'turnpage': 1,
# 'tpagemode': 'L',
# 'Fields': '',
# 'DisplayMode': 'listmode',
# 'sKuaKuID': 0,
# }
with session.get(
'http://big5.oversea.cnki.net/kns55/brief/brief.aspx',
params={
'pagename': 'ASP.brief_result_aspx',
'dbPrefix': 'SCDB',
'dbCatalog': catalog,
'ConfigFile': 'SCDB.xml',
'research': 'off',
't': timestamp,
},
cookies={
'FileNameS': quote('cnki:'),
'KNS_DisplayModel': '',
'CurTop10KeyWord': encoded_query,
'RsPerPage': '20',
},
headers={
'Referer': search_url,
}
) as response:
response.raise_for_status()
results_iframe = response.text
def main():
etiquette = '禮學'
search(query=etiquette)
if __name__ == '__main__':
main()不幸的是,这个网站的设计非常糟糕。状态是使用查询参数、cookie和仅服务器上下文的混合传递的,您无法以一种非常简单的方式查看和依赖请求历史记录。因此,尽管根据我的了解,上面产生的参数、标题和cookie与您在网站实际生活中看到的参数、标题和cookie是相同的,但在brief.aspx中忽略了几个动态生成的D3部分是失败的。所以我放弃了这个建议。
换档:
下面的建议将涵盖范围和类的使用,这些建议应该会使您达到理智:
test.py中的代码移到函数中test.py应该有一个shebang,而没有其他文件,因为只有test.py是一个有意义的入口点。Input.URL曾经被使用过吗?可能需要删除Input.webpage不应该返回任何内容;driver已经是类的成员。Input作为一个整体是可疑的。它为driver提供了一个非常薄的包装器,它本身基本上是无用的。我希望driver.get()会被移到MainPage.__init__。InputLocators也不配成为一个类。这些常量基本上可以分发到使用点,即wait.until( EC.presence_of_element_located( By.XPATH,‘/GridTableContent/tbody’))search_keyword很奇怪--首先将其初始化为静态,然后改为在submit_search中使用它作为实例变量。为什么?另外,什么是keyword?您将受益于使用PEP484类型提示。switch_to_frame有计时问题,直到我添加了两个等待:WebDriverWait(驱动程序,100).until( lambda驱动程序: driver.find_element( By.XPATH,'//iframe‘)) driver.switch_to.frame('iframeResult') WebDriverWait(驱动程序,100).until( lambda驱动程序: driver.find_element( By.XPATH,'//table’),))()。OutputLocators.CNKI是一本字典。为什么?get_single_element对其进行索引,但get_single_element本身从未被调用过。此代码:
elements = []
for item in target_data:
elements.append(item.text)
return elements可以用生成器替换:
for item in target_data:
yield item.text此代码:
i = None # get_structured_element counter不执行任何操作,因为所有局部变量都会在作用域结束时被丢弃。
此代码:
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half将永远不会看到它的第一个分支计算,因为i不是在本地定义的。我真的不知道你想在这里做什么。
这些长xpath树遍历,如
'//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]'既脆弱又难以阅读。在大多数情况下,您应该能够通过混合内部//来压缩它们,以省略路径的部分,以及对已知属性的明智引用。
你特别问
split_xpath_at_i对它的直接环境中的变量是盲目的。
如果说“它的直接环境”是指CNKI (等),那是因为它的直接环境--类静态范围--还没有初始化。CNKI可以获得对它的引用,但不能得到相反的引用。如果您希望它具有某种状态,如计数器,则需要将其提升为带有self参数的实例方法。我不知道g_s_elem是如何影响到这一点的,因为它没有在任何地方定义。
你问:
只有一个硬编码的定位器变量的SearchTextElement类--这是一个很好的方法吗?
不怎么有意思。首先,您再次将静态变量和实例变量混为一谈,因为首先将静态变量初始化为None,然后在构造后编写实例变量。如果类只包含一个成员而没有方法,为什么要构造一个类呢?
https://codereview.stackexchange.com/questions/262772
复制相似问题