文章/答案/技术大牛

发布

社区首页 >问答首页 >让xpath索引向前计数以保留表结构

问让xpath索引向前计数以保留表结构
EN

Code Review用户

提问于 2021-06-07 16:53:03

回答 1查看 116关注 0票数 3

当一个单元格中有两个或更多的值时，从web中抓取表变得很复杂。为了保留表结构，我设计了一种方法来计数其xpath的行号索引，在行号保持不变时实现嵌套列表。

    def get_structured_elements(name):
        """For target data that is nested and structured,
        such as a table with multiple values in a single cell."""
        driver = self.driver
    
        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items ＝ number_of_items_found()
        elements = [None] * number_of_items # len(elements) will exceed number_of_items.

        target_data = driver.find_elements("//table/tbody/tr[" + i + "]/td[2]/a")
    
        while i - 2 < number_of_items:
            for item in target_data:
                # print(item.text, i-1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text # set to item.text value if position is empty. 
                else:
                    elements[i - 2] = [elements[i - 2]] 
                    elements[i - 2].append(item.text) # make nested list and append new value if position is occupied.
            i += 1
    
        return elements

这个简单的逻辑工作得很好，直到我试图在一个地方管理所有的定位器变量，以使代码更可重用:如何将表达式"//table/tbody/tr[" + i + "]/td[2]/a"存储在列表或字典中，以便它在插入时仍然工作？

我想出的解决方案(即hack)是一个函数，它将迭代xpath的前后一半作为参数，如果front_half + str(i) + back_half是父(迭代器)函数局部变量的一部分，则返回i。

def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]

def xpath_index_iterator():
    for i in range(10):
        print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))


xpath_index_iterator()

# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a

问题是，split_xpath_at_i对直接环境中的变量是盲目的。我最终想出的是利用迭代器函数的属性来定义计数器i，以便变量可以像这样提供给split_xpath_at_i：

def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    try:
        i = xpath_index_iterator.i
    except:
        pass
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]
    
def xpath_index_iterator():
    xpath_index_iterator.i = 0
    lst = []
    for xpath_index_iterator.i in range(10):
        print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))

xpath_index_iterator()

# //table/tbody/tr[0]/td[2]/a
# //table/tbody/tr[1]/td[2]/a
# //table/tbody/tr[2]/td[2]/a
# //table/tbody/tr[3]/td[2]/a
# //table/tbody/tr[4]/td[2]/a
# //table/tbody/tr[5]/td[2]/a
# //table/tbody/tr[6]/td[2]/a
# //table/tbody/tr[7]/td[2]/a
# //table/tbody/tr[8]/td[2]/a
# //table/tbody/tr[9]/td[2]/a

当我试图通过一个定位器列表调用split_xpath_at_i时，问题变得更加复杂：

def split_xpath_at_i(front_half, back_half):
    """Splits xpath string at its counter index. 
    The 'else' part is to aviod errors 
    when this function is called outside an indexed environment. """
    try:
        i = xpath_index_iterator.i
    except:
        pass
    
    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

    return string

xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
         "//table/tbody/tr/td[3]/a[1]"
        ]
    
def xpath_index_iterator():
    xpath_index_iterator.i = 0
    lst = []
    for xpath_index_iterator.i in range(10):
#         print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
        lst.append(xpath[0])
    return lst

xpath_index_iterator()

# ['//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a',
#  '//table/tbody/tr[9]/td[2]/a']

以专业的方式解决这个问题是什么样子的？

整个代码：

下面的代码是从硒手册修改的。

我在这里上问了一个相关的问题，它涉及到页面对象设计的一般方法。

test.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from query import Input
import page

cnki = Input()
driver = cnki.webpage('http://big5.oversea.cnki.net/kns55/')

current_page = page.MainPage(driver)
current_page.submit_search('禮學')
current_page.switch_to_frame()
result = page.SearchResults(driver)

structured = result.get_structured_elements('titles') # I couldn't get this to work.
simple = result.simple_get_structured_elements() # but this works fine.

query.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium import webdriver

class Input:
    """This class provides a wrapper around actual working code."""
    
    # CONSTANTS
    
    URL = None
        
    def __init__(self):
        self.driver = webdriver.Chrome
    
    def webpage(self, url):
        driver = self.driver()
        driver.get(url)
        
        return driver

page.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from element import BasePageElement
from locators import InputLocators, OutputLocators
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait


class SearchTextElement(BasePageElement):
    """This class gets the search text from the specified locator"""

    #The locator for search box where search string is entered
    locator = None


class BasePage:
    """Base class to initialize the base page that will be called from all
    pages"""

    def __init__(self, driver):
        self.driver = driver

class MainPage(BasePage):
    """Home page action methods come here. I.e. Python.org"""

    search_keyword = SearchTextElement()
    
    def submit_search(self, keyword):
        """Submits keyword and triggers the search"""
        SearchTextElement.locator = InputLocators.SEARCH_FIELD
        self.search_keyword = keyword

    def select_dropdown_item(self, item):
        driver = self.driver
        by, val = InputLocators.SEARCH_ATTR
        driver.find_element(by, val + "/option[text()='" + item + "']").click()

    def click_search_button(self):
        driver = self.driver
        element = driver.find_element(*InputLocators.SEARCH_BUTTON)
        element.click()
        
    def switch_to_frame(self):
        """Use this function to get access to hidden elements. """
        driver = self.driver
        driver.switch_to.default_content()
        driver.switch_to.frame('iframeResult')

    # Maximize the number of items on display in the search results.
    def max_content(self):
        driver = self.driver
        max_content = driver.find_element_by_css_selector('#id_grid_display_num > a:nth-child(3)')
        max_content.click()
    
    
    def stop_loading_page_when_element_is_present(self, locator):
        driver = self.driver
        
        ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
        wait = WebDriverWait(driver, 30, ignored_exceptions=ignored_exceptions)
    
        wait.until(
            EC.presence_of_element_located(locator))
        driver.execute_script("window.stop();")


    def next_page(self):
        driver = self.driver

        self.stop_loading_page_when_element_is_present(InputLocators.NEXT_PAGE)
        driver.execute_script("window.stop();")
    
        try:
            driver.find_element(*InputLocators.NEXT_PAGE).click()
            print("Navigating to Next Page")
        except (TimeoutException, WebDriverException):
            print("Last page reached")
        
        
  
        
class SearchResults(BasePage):
    """Search results page action methods come here"""

    def __init__(self, driver):
        self.driver = driver
        i = None # get_structured_element counter
        
    def wait_for_page_to_load(self):
        driver = self.driver
        wait = WebDriverWait(driver, 100)
        wait.until(
            EC.presence_of_element_located(*InputLocators.MAIN_BODY))
    
    def get_single_element(self, name):
        """Returns a single value as target data."""
        driver = self.driver
        target_data = driver.find_element(*OutputLocators.CNKI[str(name.upper())])
        # SearchTextElement.locator = OutputLocators.CNKI[str(name.upper())]
        # target_data = SearchTextElement()
        return target_data
    
    def number_of_items_found(self):
        """Return the number of items found on a single page."""
        driver = self.driver
        target_data = driver.find_elements(*OutputLocators.CNKI['INDEX'])
        
        return len(target_data)
    
    def get_elements(self, name):
        """Returns simple list of values in specific data field in a table."""
        driver = self.driver
        target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
        
        elements = []
        for item in target_data:
            elements.append(item.text)
        
        return elements


    def get_structured_elements(self, name):
        """For target data that is nested and structured,
        such as a table with multiple values in a single cell."""
        driver = self.driver

        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items = self.number_of_items_found()
        elements = [None] * number_of_items

        while i - 2 < number_of_items:
            
            target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])

            for item in target_data:
                print(item.text, i - 1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text
                elif isinstance(elements[i - 2], list):
                    elements[i - 2].append(item.text)
                else:
                    elements[i - 2] = [elements[i - 2]]
                    elements[i - 2].append(item.text)
            i += 1
    
        return elements
    
    def simple_get_structured_elements(self):
        """Simple structured elements code with fixed xpath."""
        driver = self.driver

        i = 2 # keep track of 'i' to retain the document structure.
        number_of_items = self.number_of_items_found()
        elements = [None] * number_of_items
        
        while i - 2 < number_of_items:
            target_data = driver.find_elements_by_xpath\
            ('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr['\
                 + str(i) + ']/td[2]/a')

            for item in target_data:
                print(item.text, i-1)
                if elements[i - 2] == None:
                    elements[i - 2] = item.text
                elif isinstance(elements[i - 2], list):
                    elements[i - 2].append(item.text)
                else:
                    elements[i - 2] = [elements[i - 2]]
                    elements[i - 2].append(item.text)
            i += 1

        return elements

element.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.support.ui import WebDriverWait


class BasePageElement():
    """Base page class that is initialized on every page object class."""
    
    def __set__(self, obj, value):
        """Sets the text to the value supplied"""
        driver = obj.driver
        
        text_field = WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element(*self.locator))
        text_field.clear()
        text_field.send_keys(value)
        text_field.submit()

    def __get__(self, obj, owner):
        """Gets the text of the specified object"""
        driver = obj.driver
        
        WebDriverWait(driver, 100).until(
            lambda driver: driver.find_element(*self.locator))
        element = driver.find_element(*self.locator)
        return element.get_attribute("value")

locators.py

这是split_xpath_at_i坐的地方。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
# import page

class InputLocators():
    """A class for main page locators. All main page locators should come here"""
        
    def dropdown_list_xpath(attribute, value):
        string = "//select[@" + attribute + "='" + value + "']"
        
        return string
    
    MAIN_BODY = (By.XPATH, '//GridTableContent/tbody')
    SEARCH_FIELD = (By.NAME, 'txt_1_value1') # (By.ID, 'search-content-box')
    SEARCH_ATTR = (By.XPATH, dropdown_list_xpath('name', 'txt_1_sel'))
    SEARCH_BUTTON = (By.ID, 'btnSearch')
    NEXT_PAGE = (By.LINK_TEXT, "下頁")

class OutputLocators():
    """A class for search results locators. All search results locators should
    come here"""
    
    def split_xpath_at_i(front_half, back_half):
        # try:
        #     i = page.SearchResults.g_s_elem
        # except:
        #     pass

        if 'i' in locals():
            string = front_half + str(i) + back_half
        else:
            string = front_half+"SPLIT_i"+back_half
    
        return string

    CNKI = {
        "TITLES": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[2]/a')),
        "AUTHORS": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[3]/a')),
        "JOURNALS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[4]/a',
        "YEAR_ISSUE": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[5]/a',
        "DOWNLOAD_PATHS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[1]', 
        "INDEX": (By.XPATH, '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]')
    }


    # # Interim Data
    # CAPTIONS = 
    # LINKS = 
    
    # Target Data
    # TITLES = 
    # AUTHORS = 
    # JOURNALS = 
    # VOL = 
    # ISSUE = 
    # DATES = 
    # DOWNLOAD_PATHS =

xpath

python

selenium

回答 1

Code Review用户

回答已采纳

发布于 2021-06-08 01:17:27

首先:我通常建议您用直接的requests调用代替Selenium的使用。如果可能的话，它比硒高效得多。作为一个非常粗略的开端，它看起来如下：

from time import time
from typing import Iterable
from urllib.parse import quote
from requests import Session

def js_encode(u: str) -> Iterable[str]:
    for char in u:
        code = ord(char)
        if code < 128:
            yield quote(char).lower()
        else:
            yield f'%u{code:04x}'


def search(query: str):
    topic = '主题'
    # China Academic Literature Online Publishing Database
    catalog = '中国学术文献网络出版总库'
    databases = (
        '中国期刊全文数据库,'            # China Academic Journals Full-text Database
        '中国博士学位论文全文数据库,'     # China Doctoral Dissertation Full-text Database
        '中国优秀硕士学位论文全文数据库,'  # China Master's Thesis Full-text Database
        '中国重要会议论文全文数据库,'     # China Proceedings of Conference Full-text Database
        '国际会议论文全文数据库,'        # International Proceedings of Conference Full-text Database
        '中国重要报纸全文数据库,'        # China Core Newspapers Full-text Database
        '中国年鉴网络出版总库'          # China Yearbook Full-text Database
    )

    with Session() as session:
        session.headers = {
            'Accept':
                'text/html,'
                'application/xhtml+xml,'
                'application/xml;q=0.9,'
                'image/webp,'
                '*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'en-CA,en-GB;q=0.8,en;q=0.5,en-US;q=0.3',
            'Cache-Control': 'no-cache',
            'Connection': 'keep-alive',
            'DNT': '1',
            'Host': 'big5.oversea.cnki.net',
            'Pragma': 'no-cache',
            'Sec-GPC': '1',
            'User-Agent':
                'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) '
                'Gecko/20100101 '
                'Firefox/89.0',
            'Upgrade-Insecure-Requests': '1',
        }

        with session.get(
            'https://big5.oversea.cnki.net/kns55/brief/result.aspx',
            params={
                'txt_1_value1': query,
                'txt_1_sel': topic,
                'dbPrefix': 'SCDB',
                'db_opt': catalog,
                'db_value': databases,
                'search-action': 'brief/result.aspx',
            },
        ) as response:
            response.raise_for_status()
            search_url = response.url
            search_page = response.text

        encoded_query = ''.join(js_encode(',' + query))
        # epoch milliseconds
        timestamp = round(time()*1000)

        # page_params = {
        #     'curpage': 1,
        #     'RecordsPerPage': 20,
        #     'QueryID': 0,
        #     'ID': '',
        #     'turnpage': 1,
        #     'tpagemode': 'L',
        #     'Fields': '',
        #     'DisplayMode': 'listmode',
        #     'sKuaKuID': 0,
        # }

        with session.get(
            'http://big5.oversea.cnki.net/kns55/brief/brief.aspx',
            params={
                'pagename': 'ASP.brief_result_aspx',
                'dbPrefix': 'SCDB',
                'dbCatalog': catalog,
                'ConfigFile': 'SCDB.xml',
                'research': 'off',
                't': timestamp,
            },
            cookies={
                'FileNameS': quote('cnki:'),
                'KNS_DisplayModel': '',
                'CurTop10KeyWord': encoded_query,
                'RsPerPage': '20',
            },
            headers={
                'Referer': search_url,
            }
        ) as response:
            response.raise_for_status()
            results_iframe = response.text


def main():
    etiquette = '禮學'
    search(query=etiquette)


if __name__ == '__main__':
    main()

不幸的是，这个网站的设计非常糟糕。状态是使用查询参数、cookie和仅服务器上下文的混合传递的，您无法以一种非常简单的方式查看和依赖请求历史记录。因此，尽管根据我的了解，上面产生的参数、标题和cookie与您在网站实际生活中看到的参数、标题和cookie是相同的，但在brief.aspx中忽略了几个动态生成的D3部分是失败的。所以我放弃了这个建议。

换档：

下面的建议将涵盖范围和类的使用，这些建议应该会使您达到理智：

需要将test.py中的代码移到函数中
只有test.py应该有一个shebang，而没有其他文件，因为只有test.py是一个有意义的入口点。
Input.URL曾经被使用过吗？可能需要删除
Input.webpage不应该返回任何内容；driver已经是类的成员。
Input作为一个整体是可疑的。它为driver提供了一个非常薄的包装器，它本身基本上是无用的。我希望driver.get()会被移到MainPage.__init__。
InputLocators也不配成为一个类。这些常量基本上可以分发到使用点，即wait.until( EC.presence_of_element_located( By.XPATH，‘/GridTableContent/tbody’))
您的search_keyword很奇怪--首先将其初始化为静态，然后改为在submit_search中使用它作为实例变量。为什么？另外，什么是keyword？您将受益于使用PEP484类型提示。
switch_to_frame有计时问题，直到我添加了两个等待:WebDriverWait(驱动程序，100).until( lambda驱动程序: driver.find_element( By.XPATH，'//iframe‘)) driver.switch_to.frame('iframeResult') WebDriverWait(驱动程序，100).until( lambda驱动程序: driver.find_element( By.XPATH，'//table’)，))
可以删除基类末尾的()。
OutputLocators.CNKI是一本字典。为什么？get_single_element对其进行索引，但get_single_element本身从未被调用过。

此代码：

    elements = []
    for item in target_data:
        elements.append(item.text)
    return elements

可以用生成器替换：

for item in target_data:
    yield item.text

此代码：

    i = None # get_structured_element counter

不执行任何操作，因为所有局部变量都会在作用域结束时被丢弃。

此代码：

    if 'i' in locals():
        string = front_half + str(i) + back_half
    else:
        string = front_half+"SPLIT_i"+back_half

将永远不会看到它的第一个分支计算，因为i不是在本地定义的。我真的不知道你想在这里做什么。

这些长xpath树遍历，如

'//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]'

既脆弱又难以阅读。在大多数情况下，您应该能够通过混合内部//来压缩它们，以省略路径的部分，以及对已知属性的明智引用。

你特别问

split_xpath_at_i对它的直接环境中的变量是盲目的。

如果说“它的直接环境”是指CNKI (等)，那是因为它的直接环境--类静态范围--还没有初始化。CNKI可以获得对它的引用，但不能得到相反的引用。如果您希望它具有某种状态，如计数器，则需要将其提升为带有self参数的实例方法。我不知道g_s_elem是如何影响到这一点的，因为它没有在任何地方定义。

你问：

只有一个硬编码的定位器变量的SearchTextElement类--这是一个很好的方法吗？

不怎么有意思。首先，您再次将静态变量和实例变量混为一谈，因为首先将静态变量初始化为None，然后在构造后编写实例变量。如果类只包含一个成员而没有方法，为什么要构造一个类呢？

票数 3

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/262772

复制

相似问题

问让xpath索引向前计数以保留表结构
EN

整个代码：

test.py

query.py

page.py

element.py

locators.py

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问让xpath索引向前计数以保留表结构EN

整个代码：

test.py

query.py

page.py

element.py

locators.py

​

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问让xpath索引向前计数以保留表结构
EN