首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Scrapy Xpath从文件扩展名. from中选择特定元素

Scrapy Xpath从文件扩展名. from中选择特定元素
EN

Stack Overflow用户
提问于 2016-05-03 17:28:50
回答 1查看 154关注 0票数 1

我很难理解xpath。

我试着刮掉http://file-extension.net上所有的神奇数字

让我们以这个链接为例:c10

其源代码的一部分:

代码语言:javascript
复制
<table border=4 RULES=ROWS FRAME=HSIDES width=728>
            <tr class="tabhead">
              <td></td>
              <td><b>Website</b></td>
              <td><b>&nbsp;EXT&nbsp;</b></td>
              <td><b>&nbsp;Filetype description</b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</td>
            </tr>
            
<tr class="rre"><td>&nbsp;<img src="images/icon-filext.png" width="16" height="16"> &nbsp;</td><td><a href="http://filext.com/file-extension/C10">FILExt</a></td><td>&nbsp;<a class='fesl' href='file_extension_c10'>C10</a>&nbsp;</td><td>&nbsp;<a class='fesl' href='program_extension_irig'>IRIG</a> 106 <a class='fesl' href='program_extension_original'>Original</a> <a class='fesl' href='program_extension_recording'>Recording</a> <a class='fesl' href='program_extension_file'>File</a> (<a class='fesl' href='program_extension_range'>Range</a> <a class='fesl' href='program_extension_commanders'>Commanders</a> <a class='fesl' href='program_extension_council'>Council</a>)</td></tr>

<tr class="rro"><td>&nbsp;<img src="images/icon-fsorg.png" width="16" height="16"> &nbsp;</td><td><a href="http://www.file-extensions.org/c10-file-extension">File Extensions</a></td><td>&nbsp;<a class='fesl' href='file_extension_c10'>C10</a>&nbsp;</td><td>&nbsp;<a class='fesl' href='program_extension_irig'>IRIG</a> 106 <a class='fesl' href='program_extension_original'>original</a> <a class='fesl' href='program_extension_recording'>recording</a> <a class='fesl' href='program_extension_file'>file</a></td></tr>

<tr class="rre"><td>&nbsp;<img src="images/icon-dotwhat.png" width="16" height="16"> &nbsp;</td><td><a href="http://dotwhat.net/c10/9166/">DotWhat</a></td><td>&nbsp;<a class='fesl' href='file_extension_c10'>C10</a>&nbsp;</td><td>&nbsp;<a class='fesl' href='program_extension_split'>Split</a> <a class='fesl' href='program_extension_compressed'>Compressed</a> <a class='fesl' href='program_extension_archive'>Archive</a> <a class='fesl' href='program_extension_file'>File</a> <a class='fesl' href='program_extension_part'>Part</a> 10</td></tr>

<tr class="rro"><td>&nbsp;<img src="images/icon-fsorg.png" width="16" height="16"> &nbsp;</td><td><a href="http://www.file-extensions.org/c10-file-extension">File Extensions</a></td><td>&nbsp;<a class='fesl' href='file_extension_c10'>C10</a>&nbsp;</td><td>&nbsp;<a class='fesl' href='program_extension_split'>Split</a> <a class='fesl' href='program_extension_multi'>Multi</a>-<a class='fesl' href='program_extension_volume'>volume</a> ACE <a class='fesl' href='program_extension_compressed'>compressed</a> <a class='fesl' href='program_extension_file'>file</a> <a class='fesl' href='program_extension_archive'>archive</a></td></tr>

<tr class="rre"><td>&nbsp;<img src="images/icon-trid.png" width="16" height="16"> &nbsp;</td><td><a href="http://mark0.net/soft-trid-e.html">TrID</a></td><td>&nbsp;<a class='fesl' href='file_extension_c10'>C10</a>&nbsp;</td><td>&nbsp;<a class='fesl' href='program_extension_virtual'>Virtual</a> MC-10 <a class='fesl' href='program_extension_tape'>tape</a> <a class='fesl' href='program_extension_image'>image</a><br>&nbsp;<b><small>Header Hexdump</b>: <span class='hexdump'>&nbsp;55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 &nbsp;</span></small></td></tr>

<tr class="rro"><td>&nbsp;<img src="images/icon-filext.png" width="16" height="16"> &nbsp;</td><td><a href="http://filext.com/file-extension/C10">FILExt</a></td><td>&nbsp;<a class='fesl' href='file_extension_c10'>C10</a>&nbsp;</td><td>&nbsp;<a class='fesl' href='program_extension_winace'>WinAce</a> <a class='fesl' href='program_extension_compressed'>Compressed</a> <a class='fesl' href='program_extension_file'>File</a> <a class='fesl' href='program_extension_split'>Split</a> <a class='fesl' href='program_extension_portion'>Portion</a> of <a class='fesl' href='program_extension_compressed'>Compressed</a> <a class='fesl' href='program_extension_file'>File</a> (e-<a class='fesl' href='program_extension_merge'>merge</a> <a class='fesl' href='program_extension_gmbh'>GmbH</a>)</td></tr>

<tr class="rre"><td>&nbsp;<img src="images/icon-fileinfo.png" width="16" height="16"> &nbsp;</td><td><a href="http://www.fileinfo.com/extension/c10">FileInfo</a></td><td>&nbsp;<a class='fesl' href='file_extension_c10'>C10</a>&nbsp;</td><td>&nbsp;<a class='fesl' href='program_extension_winace'>WinAce</a> <a class='fesl' href='program_extension_split'>Split</a> <a class='fesl' href='program_extension_archive'>Archive</a> <a class='fesl' href='program_extension_part'>Part</a> 10</td></tr>

          </table>

我只想从Trid (具有十六进制值的文件类型描述)获得文件类型描述。

问题是,我不知道为什么来自Filtype描述的每个单词都是链接。

这是我的代码:

代码语言:javascript
复制
for sel in response.xpath('//table[@border=4]'):
    hex = sel.xpath('//span[@class="hexdump"]/text()').extract_first(default='Rien t nul')
    if len(hex) > 7:
        ext = sel.xpath('//a[text()="TrID"]/@href.a[@class="fesl"]/text()').extract()
        print "Nom : %s Hex %s " % (ext,hex)

当然,//a[text()="TrID"]/@href.a[@class="fesl"不起作用,但这正是我想要的:

代码语言:javascript
复制
If you find a link name wich contains "Trid" give me it's filedescription

知道吗?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-05-03 17:42:41

代码语言:javascript
复制
'//td[./a[contains(text(), "TrID")]]/following-sibling::td[2]//text()'

只需将TrID更改为所需行中的另一个文本即可。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/37010621

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档