文章/答案/技术大牛

发布

社区首页 >问答首页 >在执行以下步骤后，无法从网页中获取动态填充的数字

问在执行以下步骤后，无法从网页中获取动态填充的数字
EN

Stack Overflow用户

提问于 2020-10-14 15:00:43

回答 2查看 284关注 0票数 2

我创建了一个脚本，使用请求模块和BeautifulSoup库从网页中获取一些表格内容。要生成表，必须手动执行我在所附图像中显示的步骤。下面粘贴的代码是一个工作代码，但我试图解决的主要问题是以编程方式获取title编号，在本例中，628086906是附加到我在这里硬编码的table_link的。

在步骤6中，单击工具按钮后，当您将光标悬停在地图上时，您可以看到这个选项Multiple，当您单击该选项时，可以看到包含标题号的url。

起始页

这正是脚本所遵循的台阶。

这是linc编号0030278592，它需要在步骤6中输入输入框。

我尝试过(在table_link中使用硬编码的标题编号)：

import requests
from bs4 import BeautifulSoup

link = 'https://alta.registries.gov.ab.ca/spinii/logon.aspx'
lnotice = 'https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx'
search_page = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx'
map_page = 'http://alta.registries.gov.ab.ca/SpinII/mapindex.aspx'
map_find = 'http://alta.registries.gov.ab.ca/SpinII/mapfinds.aspx'
table_link = 'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title=628086906'

def get_content(s,link):   
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['uctrlLogon:cmdLogonGuest.x'] = '80'
    payload['uctrlLogon:cmdLogonGuest.y'] = '20'

    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['cmdYES.x'] = '52'
    payload['cmdYES.y'] = '8'

    s.post(lnotice,data=payload)
    s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/spinii/welcomeguest.aspx'
    
    s.get(search_page)
    s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx'
    
    s.get(map_page)
    
    r = s.get(map_find)
    s.headers['Referer'] = 'http://alta.registries.gov.ab.ca/SpinII/mapfinds.aspx'
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTTARGET'] = 'Finds$lstFindTypes'
    payload['Finds:lstFindTypes'] = 'Linc'
    payload['Finds:ctlLincNumber:txtLincNumber'] = '0030278592'
    
    r = s.post(map_find,data=payload)
    
    r = s.get(table_link)
    print(r.text)


if __name__ == "__main__":
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
        get_content(s,link)

我怎样才能从网址上获取标题号码？

或

我如何从该站点获取所有linc号码，这样我就根本不需要使用地图了？

The only problem with this site is that it is unavailable in daytime for maintenance.

python-3.x

web-scraping

python-requests

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-10-19 16:50:02

这些数据来自：

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

在被OpenLayers库使用之前，内容是以自定义格式编码的。所有的解码都位于这个JS文件中。如果您美化它，您可以查找解码它是WayTo.Wtb.Format.WTB的OpenLayers.Class。二进制按字节进行解码，如下所示：

switch(elementType){
    case 1:
        var lineColor = new WayTo.Wtb.Element.LineColor();
        byteOffset = lineColor.parse(dataReader, byteOffset);
        outputElement = lineColor;
        break;
    case 2:
        var lineStyle = new WayTo.Wtb.Element.LineStyle();
        byteOffset = lineStyle.parse(dataReader, byteOffset);
        outputElement = lineStyle;
        break;
    case 3:
        var ellipse = new WayTo.Wtb.Element.Ellipse();
        byteOffset = ellipse.parse(dataReader, byteOffset);
        outputElement = ellipse;
        break;
    ........
}

为了得到原始数据，我们必须复制这个解码算法。我们不需要对所有的对象进行解码，我们只想得到正确的偏移量并正确提取strings。下面是解码部分的python脚本，用于解码文件中的数据(卷曲的输出)：

with open("wtb.bin", mode='rb') as file:
    encodedData = file.read()
    offset = 0
    objects = []

    while offset < len(encodedData):

        elementSize = encodedData[offset]
        offset+=1
        elementType = encodedData[offset]
        offset+=1

        if elementType == 0:
            break

        curElemSize = elementSize
        curElemType = elementType

        if elementType== 114:
            largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")
            offset+=4
            largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            curElemSize = largeElementSize
            curElemType = largeElementType

        print(f"type {curElemType} | size {curElemSize}")
        offsetInit = offset

        if curElemType == 1:
            offset+=4
        elif curElemType == 2:
            offset+=2
        elif curElemType == 3:
            offset+=20
        elif curElemType == 4:
            offset+=28
        elif curElemType == 5:
            offset+=12
        elif curElemType == 6:
            textLength = curElemSize - 3
            objects.append({
                "type": "Text",
                "x_position": int.from_bytes(encodedData[offset:offset+2], "little"),
                "y_position": int.from_bytes(encodedData[offset+2:offset+4], "little"),
                "rotation": int.from_bytes(encodedData[offset+4:offset+6], "little"),
                "text": encodedData[offset+6:offset+6+(textLength*2)].decode("utf-8").replace('\x00','')
            })
            offset+=6+(textLength*2)
        elif curElemType == 7:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 27:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 8:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 28:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 13:
            offset+=4
        elif curElemType == 14:
            offset+=2
        elif curElemType == 15:
            offset+=2
        elif curElemType == 100:
            pass
        elif curElemType == 101:
            offset+=20
        elif curElemType == 102:
            offset+=2
        elif curElemType == 103:
            pass
        elif curElemType == 104:
            highShort = int.from_bytes(encodedData[offset+2:offset+4], "little")
            lowShort = int.from_bytes(encodedData[offset+4:offset+6], "little")
            objects.append({
                "type": "StartNumericCell",
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurrence": (highShort << 16) + lowShort
            })
            offset+=6
        elif curElemType == 105:
            #end cell
            pass
        elif curElemType == 109:
            textLength = curElemSize - 1
            objects.append({
                "type": "StartAlphanumericCell",
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurrence":encodedData[offset+2:offset+2+(textLength*2)].decode("utf-8").replace('\x00','')
            })
            offset+=2+(textLength*2)
        elif curElemType == 111:
            offset+=40
        elif curElemType == 112:
            objects.append({
                "type": "CoordinatePlane",
                "projection_code": encodedData[offset+48:offset+52].decode("utf-8").replace('\x00','')
            })
            offset+=52
        elif curElemType == 113:
            offset+=24
        elif curElemType == 256:
            nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")
            objects.append({
                "type": "LargePolygon",
                "name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),
                "occurence": int.from_bytes(encodedData[offset+2:offset+6], "little")
            })
            if nameLength > 0:
                offset+= 16 + nameLength
                if encodedData[offset] == 0:
                    offset+=1
            else:
                offset+= 16
            numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            offset+=numberOfPoints*8
        elif curElemType == 257:
            pass
        else:
            offset+= curElemSize*2
        print(f"offset diff {offset-offsetInit}")
        print("--------------------------------")

    print(objects)
    print(len(encodedData))
    print(offset)

(Sidenote:注意元素大小在大端，所有其他值都以小endian表示)

运行这个repl.it来查看它是如何解码文件的

从这里开始，为了清晰起见，我将描述所有步骤(甚至那些您已经弄清楚的步骤)，然后构建抓取数据的步骤：

使用以下方式登录该网站：

GET https://alta.registries.gov.ab.ca/spinii/logon.aspx

刮掉输入名称/值，然后添加uctrlLogon:cmdLogonGuest.x和uctrlLogon:cmdLogonGuest.y，然后调用

POST https://alta.registries.gov.ab.ca/spinii/logon.aspx

法律通知

合法的通知呼叫不是获取地图值所必需的，而是获取项目信息(最后一步)所必需的。

GET https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

刮除input标记名称/值，并设置cmdYES.x和cmdYES.y，然后调用

POST https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

地图数据

调用服务器映射API：

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

有以下数据：

{
    "mt":"titleresults",
    "qt":"lincNo",
    "LINCNumber": lincNumber,
    "rights": "B", #not required
    "cx": 1920, #screen definition
    "cy": 1080,
}

cx/xy是画布大小

使用上述方法解码编码的数据。你会得到：

[{'type': 'LargePolygon', 'name': '0010495134 8722524;1;162', 'entity': 23, 'occurence': 628079167, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170859 8022146;8;99', 'entity': 23, 'occurence': 628048595, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010691822 8722524;1;163', 'entity': 23, 'occurence': 628222354, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169736 8022146;8;89', 'entity': 23, 'occurence': 628021327, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694454 8722524;1;179', 'entity': 23, 'occurence': 628191678, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694362 8722524;1;178', 'entity': 23, 'occurence': 628307403, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010433381 8722524;1;177', 'entity': 23, 'occurence': 628209696, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169710 8022146;8;88A', 'entity': 23, 'occurence': 628021328, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694355 8722524;1;176', 'entity': 23, 'occurence': 628315826, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170866 8022146;8;100', 'entity': 23, 'occurence': 628163431, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694347 8722524;1;175', 'entity': 23, 'occurence': 628132810, 'line_color_green': 0, 'line_color_red': 129,

提取信息

如果您想要针对特定的lincNumber，您将需要查找多边形的样式，因为对于“多个”值(例如带有多个项的值)，没有提到响应的lincNumber id，只是一个链接引用。以下内容将得到所选的项：

selectedZone = [
    t 
    for t in objects 
    if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255
][0]
print(selectedZone)

调用您在文章中提到的url以获取数据并提取表：

GET https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}

完整的代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

lincNumber = "0030278592"
#lincNumber = "0010661156"

s = requests.Session()

# 1) login
r = s.get("https://alta.registries.gov.ab.ca/spinii/logon.aspx")
soup = BeautifulSoup(r.text, "html.parser")

payload = dict([
    (t["name"], t.get("value", ""))
    for t in soup.findAll("input")
])
payload["uctrlLogon:cmdLogonGuest.x"] = 76
payload["uctrlLogon:cmdLogonGuest.y"] = 25
s.post("https://alta.registries.gov.ab.ca/spinii/logon.aspx",data=payload)

# 2) legal notice
r = s.get("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx")
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
    (t["name"], t.get("value", ""))
    for t in soup.findAll("input")
])
payload["cmdYES.x"] = 82
payload["cmdYES.y"] = 3
s.post("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx", data = payload)

# 3) map data
r = s.post("http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx",
    data= {
        "mt":"titleresults",
        "qt":"lincNo",
        "LINCNumber": lincNumber,
        "rights": "B", #not required
        "cx": 1920, #screen definition
        "cy": 1080,
    })

def decodeWtb(encodedData):
    offset = 0

    objects = []
    iteration = 0

    while offset < len(encodedData):

        elementSize = encodedData[offset]
        offset+=1
        elementType = encodedData[offset]
        offset+=1

        if elementType == 0:
            break

        curElemSize = elementSize
        curElemType = elementType

        if elementType== 114:
            largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")
            offset+=4
            largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            curElemSize = largeElementSize
            curElemType = largeElementType

        offsetInit = offset

        if curElemType == 1:
            offset+=4
        elif curElemType == 2:
            offset+=2
        elif curElemType == 3:
            offset+=20
        elif curElemType == 4:
            offset+=28
        elif curElemType == 5:
            offset+=12
        elif curElemType == 6:
            textLength = curElemSize - 3
            offset+=6+(textLength*2)
        elif curElemType == 7:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 27:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 8:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 28:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 13:
            offset+=4
        elif curElemType == 14:
            offset+=2
        elif curElemType == 15:
            offset+=2
        elif curElemType == 100:
            pass
        elif curElemType == 101:
            offset+=20
        elif curElemType == 102:
            offset+=2
        elif curElemType == 103:
            pass
        elif curElemType == 104:
            offset+=6
        elif curElemType == 105:
            pass
        elif curElemType == 109:
            textLength = curElemSize - 1
            offset+=2+(textLength*2)
        elif curElemType == 111:
            offset+=40
        elif curElemType == 112:
            offset+=52
        elif curElemType == 113:
            offset+=24
        elif curElemType == 256:
            nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")
            objects.append({
                "type": "LargePolygon",
                "name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurence": int.from_bytes(encodedData[offset+2:offset+6], "little"),
                "line_color_green": encodedData[offset + 8],
                "line_color_red": encodedData[offset + 7],
                "line_color_blue": encodedData[offset + 9],
                "fill_color_green": encodedData[offset + 10],
                "fill_color_red": encodedData[offset + 11],
                "fill_color_blue": encodedData[offset + 13]
            })
            if nameLength > 0:
                offset+= 16 + nameLength
                if encodedData[offset] == 0:
                    offset+=1
            else:
                offset+= 16
            numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            offset+=numberOfPoints*8
        elif curElemType == 257:
            pass
        else:
            offset+= curElemSize*2

    return objects

# 4) decode custom format
objects = decodeWtb(r.content)

# 5) get the selected area
selectedZone = [
    t 
    for t in objects 
    if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255
][0]
print(selectedZone)

# 6) get the info about item
r = s.get(f'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}')
df = pd.read_html(r.content, attrs = {'class': 'bodyText'}, header =0)[0]
del df['Add to Cart']
del df['View']
print(df[:-1])

在repl.it上运行这个

输出

  Title Number           Type LINC Number Short Legal   Rights Registration Date Change/Cancel Date
0    052400228  Current Title  0030278592  0420091;16  Surface        19/09/2005         13/11/2019
1    072294084  Current Title  0030278551  0420091;12  Surface        22/05/2007         21/08/2007
2    072400529  Current Title  0030278469   0420091;3  Surface        05/07/2007         28/08/2007
3    072498228  Current Title  0030278501   0420091;7  Surface        18/08/2007         08/02/2008
4    072508699  Current Title  0030278535  0420091;10  Surface        23/08/2007         13/12/2007
5    072559500  Current Title  0030278477   0420091;4  Surface        17/09/2007         19/11/2007
6    072559508  Current Title  0030278576  0420091;14  Surface        17/09/2007         09/01/2009
7    072559521  Current Title  0030278519   0420091;8  Surface        17/09/2007         07/11/2007
8    072559530  Current Title  0030278493   0420091;6  Surface        17/09/2007         25/08/2008
9    072559605  Current Title  0030278485   0420091;5  Surface        17/09/2007         23/12/2008

如果您想获得更多的条目，可以查看objects字段。如果你想得到更多关于物品的信息，比如坐标等等，你可以改进解码器。

还可以通过查看包含lincNumber的name字段来匹配目标周围的其他lincNumber，除非其中有“多重”名称。

有趣的事实：

在此流中不需要设置http头。

票数 6

Stack Overflow用户

发布于 2020-10-17 15:22:09

有两个选项可以获取您正在寻找的信息，一个是selenium，您可能已经知道了。

打开netwok选项卡，当您在地图上悬停时，监视通过浏览器传递的请求(如果请求是否发送到服务器)。对于请求和BS4，您最好的选择是，如果数据已经加载，那么下面的解决方案可能会起作用。

import re 
print(re.findall(r’628086906’, r.text) )

如果打印出数据号，就意味着数据可用在json中，并加载了页面，您可以加载json，也可以使用regex查找。否则，您唯一的选择是selenium。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64356140

复制

相似问题

问在执行以下步骤后，无法从网页中获取动态填充的数字
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在执行以下步骤后，无法从网页中获取动态填充的数字EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在执行以下步骤后，无法从网页中获取动态填充的数字
EN