首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >让xpath索引向前计数以保留表结构

让xpath索引向前计数以保留表结构
EN

Code Review用户
提问于 2021-06-07 16:53:03
回答 1查看 116关注 0票数 3

当一个单元格中有两个或更多的值时,从web中抓取表变得很复杂。为了保留表结构,我设计了一种方法来计数其xpath的行号索引,在行号保持不变时实现嵌套列表。

代码语言:javascript
复制
    def get_structured_elements(name):
        """For target data that is nested and structured,
        such as a table with multiple values in a single cell."""
        driver = self.driver
    
        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items = number_of_items_found()
        elements = [None] * number_of_items # len(elements) will exceed number_of_items.

        target_data = driver.find_elements("//table/tbody/tr[" + i + "]/td[2]/a")
    
        while i - 2 < number_of_items:
            for item in target_data:
                # print(item.text, i-1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text # set to item.text value if position is empty. 
                else:
                    elements[i - 2] = [elements[i - 2]] 
                    elements[i - 2].append(item.text) # make nested list and append new value if position is occupied.
            i += 1
    
        return elements

这个简单的逻辑工作得很好,直到我试图在一个地方管理所有的定位器变量,以使代码更可重用:如何将表达式"//table/tbody/tr[" + i + "]/td[2]/a"存储在列表或字典中,以便它在插入时仍然工作?

我想出的解决方案(即hack)是一个函数,它将迭代xpath的前后一半作为参数,如果front_half + str(i) + back_half是父(迭代器)函数局部变量的一部分,则返回i

代码语言:javascript
复制
def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]

def xpath_index_iterator():
    for i in range(10):
        print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))


xpath_index_iterator()

# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a

问题是,split_xpath_at_i对直接环境中的变量是盲目的。我最终想出的是利用迭代器函数的属性来定义计数器i,以便变量可以像这样提供给split_xpath_at_i

代码语言:javascript
复制
def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    try:
        i = xpath_index_iterator.i
    except:
        pass
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]
    
def xpath_index_iterator():
    xpath_index_iterator.i = 0
    lst = []
    for xpath_index_iterator.i in range(10):
        print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))

xpath_index_iterator()

# //table/tbody/tr[0]/td[2]/a
# //table/tbody/tr[1]/td[2]/a
# //table/tbody/tr[2]/td[2]/a
# //table/tbody/tr[3]/td[2]/a
# //table/tbody/tr[4]/td[2]/a
# //table/tbody/tr[5]/td[2]/a
# //table/tbody/tr[6]/td[2]/a
# //table/tbody/tr[7]/td[2]/a
# //table/tbody/tr[8]/td[2]/a
# //table/tbody/tr[9]/td[2]/a

当我试图通过一个定位器列表调用split_xpath_at_i时,问题变得更加复杂:

代码语言:javascript
复制
def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    try:
        i = xpath_index_iterator.i
    except:
        pass
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]
    
def xpath_index_iterator():
    xpath_index_iterator.i = 0
    lst = []
    for xpath_index_iterator.i in range(10):
#         print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
        lst.append(xpath[0])
    return lst

xpath_index_iterator()

# ['//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a']

以专业的方式解决这个问题是什么样子的?

整个代码:

下面的代码是从硒手册修改的。

我在这里上问了一个相关的问题,它涉及到页面对象设计的一般方法。

test.py

代码语言:javascript
复制
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from query import Input
import page

cnki = Input()
driver = cnki.webpage('http://big5.oversea.cnki.net/kns55/')

current_page = page.MainPage(driver)
current_page.submit_search('禮學')
current_page.switch_to_frame()
result = page.SearchResults(driver)

structured = result.get_structured_elements('titles') # I couldn't get this to work.
simple = result.simple_get_structured_elements() # but this works fine.

query.py

代码语言:javascript
复制
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium import webdriver

class Input:
    """This class provides a wrapper around actual working code."""
    
    # CONSTANTS
    
    URL = None
        
    def __init__(self):
        self.driver = webdriver.Chrome
    
    def webpage(self, url):
        driver = self.driver()
        driver.get(url)
        
        return driver

page.py

代码语言:javascript
复制
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from element import BasePageElement
from locators import InputLocators, OutputLocators
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait


class SearchTextElement(BasePageElement):
    """This class gets the search text from the specified locator"""

    #The locator for search box where search string is entered
    locator = None


class BasePage:
    """Base class to initialize the base page that will be called from all
    pages"""

    def __init__(self, driver):
        self.driver = driver

class MainPage(BasePage):
    """Home page action methods come here. I.e. Python.org"""

    search_keyword = SearchTextElement()
    
    def submit_search(self, keyword):
        """Submits keyword and triggers the search"""
        SearchTextElement.locator = InputLocators.SEARCH_FIELD
        self.search_keyword = keyword

    def select_dropdown_item(self, item):
        driver = self.driver
        by, val = InputLocators.SEARCH_ATTR
        driver.find_element(by, val + "/option[text()='" + item + "']").click()

    def click_search_button(self):
        driver = self.driver
        element = driver.find_element(*InputLocators.SEARCH_BUTTON)
        element.click()
        
    def switch_to_frame(self):
        """Use this function to get access to hidden elements. """
        driver = self.driver
        driver.switch_to.default_content()
        driver.switch_to.frame('iframeResult')

    # Maximize the number of items on display in the search results.
    def max_content(self):
        driver = self.driver
        max_content = driver.find_element_by_css_selector('#id_grid_display_num > a:nth-child(3)')
        max_content.click()
    
    
    def stop_loading_page_when_element_is_present(self, locator):
        driver = self.driver
        
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(driver, 30, ignored_exceptions=ignored_exceptions)
    
        wait.until(
            EC.presence_of_element_located(locator))
        driver.execute_script("window.stop();")


    def next_page(self):
        driver = self.driver

        self.stop_loading_page_when_element_is_present(InputLocators.NEXT_PAGE)
        driver.execute_script("window.stop();")
    
        try:
            driver.find_element(*InputLocators.NEXT_PAGE).click()
            print("Navigating to Next Page")
        except (TimeoutException, WebDriverException):
            print("Last page reached")
        
        
  
        
class SearchResults(BasePage):
    """Search results page action methods come here"""

    def __init__(self, driver):
        self.driver = driver
        i = None # get_structured_element counter
        
    def wait_for_page_to_load(self):
        driver = self.driver
        wait = WebDriverWait(driver, 100)
        wait.until(
            EC.presence_of_element_located(*InputLocators.MAIN_BODY))
    
    def get_single_element(self, name):
        """Returns a single value as target data."""
        driver = self.driver
        target_data = driver.find_element(*OutputLocators.CNKI[str(name.upper())])
        # SearchTextElement.locator = OutputLocators.CNKI[str(name.upper())]
        # target_data = SearchTextElement()
        return target_data
    
    def number_of_items_found(self):
        """Return the number of items found on a single page."""
        driver = self.driver
        target_data = driver.find_elements(*OutputLocators.CNKI['INDEX'])
        
        return len(target_data)
    
    def get_elements(self, name):
        """Returns simple list of values in specific data field in a table."""
        driver = self.driver
        target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
        
        elements = []
        for item in target_data:
            elements.append(item.text)
        
        return elements


    def get_structured_elements(self, name):
        """For target data that is nested and structured,
        such as a table with multiple values in a single cell."""
        driver = self.driver

        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items = self.number_of_items_found()
        elements = [None] * number_of_items

        while i - 2 < number_of_items:
            
            target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])

            for item in target_data:
                print(item.text, i - 1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text
                elif isinstance(elements[i - 2], list):
                    elements[i - 2].append(item.text)
                else:
                    elements[i - 2] = [elements[i - 2]]
                    elements[i - 2].append(item.text)
            i += 1
    
        return elements
    
    def simple_get_structured_elements(self):
        """Simple structured elements code with fixed xpath."""
        driver = self.driver

        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items = self.number_of_items_found()
        elements = [None] * number_of_items
        
        while i - 2 < number_of_items:
            target_data = driver.find_elements_by_xpath\
            ('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr['\
                 + str(i) + ']/td[2]/a')

            for item in target_data:
                print(item.text, i-1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text
                elif isinstance(elements[i - 2], list):
                    elements[i - 2].append(item.text)
                else:
                    elements[i - 2] = [elements[i - 2]]
                    elements[i - 2].append(item.text)
            i += 1

        return elements

element.py

代码语言:javascript
复制
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.support.ui import WebDriverWait


class BasePageElement():
    """Base page class that is initialized on every page object class."""
    
    def __set__(self, obj, value):
        """Sets the text to the value supplied"""
        driver = obj.driver
        
        text_field = WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element(*self.locator))
        text_field.clear()
        text_field.send_keys(value)
        text_field.submit()

    def __get__(self, obj, owner):
        """Gets the text of the specified object"""
        driver = obj.driver
        
        WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element(*self.locator))
        element = driver.find_element(*self.locator)
        return element.get_attribute("value")

locators.py

这是split_xpath_at_i坐的地方。

代码语言:javascript
复制
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
# import page

class InputLocators():
    """A class for main page locators. All main page locators should come here"""
        
    def dropdown_list_xpath(attribute, value):
        string = "//select[@" + attribute + "='" + value + "']"
        
        return string
    
    MAIN_BODY = (By.XPATH, '//GridTableContent/tbody')
    SEARCH_FIELD = (By.NAME, 'txt_1_value1') # (By.ID, 'search-content-box')
    SEARCH_ATTR = (By.XPATH, dropdown_list_xpath('name', 'txt_1_sel'))
    SEARCH_BUTTON = (By.ID, 'btnSearch')
    NEXT_PAGE = (By.LINK_TEXT, "下頁")

class OutputLocators():
    """A class for search results locators. All search results locators should
    come here"""
    
    def split_xpath_at_i(front_half, back_half):
        # try:
        #     i = page.SearchResults.g_s_elem
        # except:
        #     pass

        if 'i' in locals():
            string = front_half + str(i) + back_half
        else:
            string = front_half+"SPLIT_i"+back_half
    
        return string

    CNKI = {
        "TITLES": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[2]/a')),
        "AUTHORS": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[3]/a')),
        "JOURNALS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[4]/a',
        "YEAR_ISSUE": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[5]/a',
        "DOWNLOAD_PATHS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[1]', 
        "INDEX": (By.XPATH, '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]')
    }


    # # Interim Data
    # CAPTIONS = 
    # LINKS = 
    
    # Target Data
    # TITLES = 
    # AUTHORS = 
    # JOURNALS = 
    # VOL = 
    # ISSUE = 
    # DATES = 
    # DOWNLOAD_PATHS = 

EN

回答 1

Code Review用户

回答已采纳

发布于 2021-06-08 01:17:27

首先:我通常建议您用直接的requests调用代替Selenium的使用。如果可能的话,它比硒高效得多。作为一个非常粗略的开端,它看起来如下:

代码语言:javascript
复制
from time import time
from typing import Iterable
from urllib.parse import quote
from requests import Session

def js_encode(u: str) -> Iterable[str]:
    for char in u:
        code = ord(char)
        if code < 128:
            yield quote(char).lower()
        else:
            yield f'%u{code:04x}'


def search(query: str):
    topic = '主题'
    # China Academic Literature Online Publishing Database
    catalog = '中国学术文献网络出版总库'
    databases = (
        '中国期刊全文数据库,'            # China Academic Journals Full-text Database
        '中国博士学位论文全文数据库,'     # China Doctoral Dissertation Full-text Database
        '中国优秀硕士学位论文全文数据库,'  # China Master's Thesis Full-text Database
        '中国重要会议论文全文数据库,'     # China Proceedings of Conference Full-text Database
        '国际会议论文全文数据库,'        # International Proceedings of Conference Full-text Database
        '中国重要报纸全文数据库,'        # China Core Newspapers Full-text Database
        '中国年鉴网络出版总库'          # China Yearbook Full-text Database
    )

    with Session() as session:
        session.headers = {
            'Accept':
                'text/html,'
                'application/xhtml+xml,'
                'application/xml;q=0.9,'
                'image/webp,'
                '*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'en-CA,en-GB;q=0.8,en;q=0.5,en-US;q=0.3',
            'Cache-Control': 'no-cache',
            'Connection': 'keep-alive',
            'DNT': '1',
            'Host': 'big5.oversea.cnki.net',
            'Pragma': 'no-cache',
            'Sec-GPC': '1',
            'User-Agent':
                'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) '
                'Gecko/20100101 '
                'Firefox/89.0',
            'Upgrade-Insecure-Requests': '1',
        }

        with session.get(
            'https://big5.oversea.cnki.net/kns55/brief/result.aspx',
            params={
                'txt_1_value1': query,
                'txt_1_sel': topic,
                'dbPrefix': 'SCDB',
                'db_opt': catalog,
                'db_value': databases,
                'search-action': 'brief/result.aspx',
            },
        ) as response:
            response.raise_for_status()
            search_url = response.url
            search_page = response.text

        encoded_query = ''.join(js_encode(',' + query))
        # epoch milliseconds
        timestamp = round(time()*1000)

        # page_params = {
        #     'curpage': 1,
        #     'RecordsPerPage': 20,
        #     'QueryID': 0,
        #     'ID': '',
        #     'turnpage': 1,
        #     'tpagemode': 'L',
        #     'Fields': '',
        #     'DisplayMode': 'listmode',
        #     'sKuaKuID': 0,
        # }

        with session.get(
            'http://big5.oversea.cnki.net/kns55/brief/brief.aspx',
            params={
                'pagename': 'ASP.brief_result_aspx',
                'dbPrefix': 'SCDB',
                'dbCatalog': catalog,
                'ConfigFile': 'SCDB.xml',
                'research': 'off',
                't': timestamp,
            },
            cookies={
                'FileNameS': quote('cnki:'),
                'KNS_DisplayModel': '',
                'CurTop10KeyWord': encoded_query,
                'RsPerPage': '20',
            },
            headers={
                'Referer': search_url,
            }
        ) as response:
            response.raise_for_status()
            results_iframe = response.text


def main():
    etiquette = '禮學'
    search(query=etiquette)


if __name__ == '__main__':
    main()

不幸的是,这个网站的设计非常糟糕。状态是使用查询参数、cookie和仅服务器上下文的混合传递的,您无法以一种非常简单的方式查看和依赖请求历史记录。因此,尽管根据我的了解,上面产生的参数、标题和cookie与您在网站实际生活中看到的参数、标题和cookie是相同的,但在brief.aspx中忽略了几个动态生成的D3部分是失败的。所以我放弃了这个建议。

换档:

下面的建议将涵盖范围和类的使用,这些建议应该会使您达到理智:

  • 需要将test.py中的代码移到函数中
  • 只有test.py应该有一个shebang,而没有其他文件,因为只有test.py是一个有意义的入口点。
  • Input.URL曾经被使用过吗?可能需要删除
  • Input.webpage不应该返回任何内容;driver已经是类的成员。
  • Input作为一个整体是可疑的。它为driver提供了一个非常薄的包装器,它本身基本上是无用的。我希望driver.get()会被移到MainPage.__init__
  • InputLocators也不配成为一个类。这些常量基本上可以分发到使用点,即wait.until( EC.presence_of_element_located( By.XPATH,‘/GridTableContent/tbody’))
  • 您的search_keyword很奇怪--首先将其初始化为静态,然后改为在submit_search中使用它作为实例变量。为什么?另外,什么是keyword?您将受益于使用PEP484类型提示。
  • switch_to_frame有计时问题,直到我添加了两个等待:WebDriverWait(驱动程序,100).until( lambda驱动程序: driver.find_element( By.XPATH,'//iframe‘)) driver.switch_to.frame('iframeResult') WebDriverWait(驱动程序,100).until( lambda驱动程序: driver.find_element( By.XPATH,'//table’),))
  • 可以删除基类末尾的()
  • OutputLocators.CNKI是一本字典。为什么?get_single_element对其进行索引,但get_single_element本身从未被调用过。

此代码:

代码语言:javascript
复制
    elements = []
    for item in target_data:
        elements.append(item.text)
    return elements

可以用生成器替换:

代码语言:javascript
复制
for item in target_data:
    yield item.text

此代码:

代码语言:javascript
复制
    i = None # get_structured_element counter

不执行任何操作,因为所有局部变量都会在作用域结束时被丢弃。

此代码:

代码语言:javascript
复制
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

将永远不会看到它的第一个分支计算,因为i不是在本地定义的。我真的不知道你想在这里做什么。

这些长xpath树遍历,如

代码语言:javascript
复制
'//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]'

既脆弱又难以阅读。在大多数情况下,您应该能够通过混合内部//来压缩它们,以省略路径的部分,以及对已知属性的明智引用。

你特别问

split_xpath_at_i对它的直接环境中的变量是盲目的。

如果说“它的直接环境”是指CNKI (等),那是因为它的直接环境--类静态范围--还没有初始化。CNKI可以获得对它的引用,但不能得到相反的引用。如果您希望它具有某种状态,如计数器,则需要将其提升为带有self参数的实例方法。我不知道g_s_elem是如何影响到这一点的,因为它没有在任何地方定义。

你问:

只有一个硬编码的定位器变量的SearchTextElement类--这是一个很好的方法吗?

不怎么有意思。首先,您再次将静态变量和实例变量混为一谈,因为首先将静态变量初始化为None,然后在构造后编写实例变量。如果类只包含一个成员而没有方法,为什么要构造一个类呢?

票数 3
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/262772

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档