文章/答案/技术大牛

发布

问使用bash解析HTML表列
EN

Stack Overflow用户

提问于 2015-09-09 21:29:06

回答 2查看 2.5K关注 0票数 3

我正在尝试从HTML格式的表中提取3列。我需要主机名，产品+地区和日期添加。所以它们应该是列1，3，4。

<div class="table sectionedit2">
  <table class="inline">
    <tr class="row0">
      <th class="col0 centeralign">hostname</th>
      <th class="col1 centeralign">AKA (Client hostname)</th>
      <th class="col2 leftalign">Product + Region</th>
      <th class="col3 centeralign">date added</th>
      <th class="col4 centeralign">  decom. date  </th>
      <th class="col5 centeralign">           builder           </th>
      <th class="col6 centeralign">  build cross-checker  </th>
      <th class="col7 leftalign"> <strong>decommissioner</strong></th>
      <th class="col8 centeralign">customer managed filesystems</th>
      <th class="col9 centeralign">  only company has root?  </th>
    </tr>
    <tr class="row1">
      <th class="col0 centeralign">HostName01</th>
      <td class="col1 leftalign">Host01</td>
      <td class="col2 leftalign">EU</td>
      <td class="col3 centeralign">2007-01-01</td>
      <td class="col4 leftalign"></td>
      <td class="col5 centeralign">Me</td>
      <td class="col6 centeralign">You</td>
      <td class="col7 leftalign">Builder01</td>
      <td class="col8 leftalign">xChecker01</td>
      <td class="col9 centeralign">yes</td>
    </tr>
   <tr class="row2">
     <th class="col0 centeralign">HostName02</th>
     <td class="col1 leftalign">Host02</td>
     <td class="col2 leftalign">U.S</td>
     <td class="col3 centeralign">2008-09-29</td>
     <td class="col4 leftalign"></td>
     <td class="col5 leftalign">Me01</td>
     <td class="col6 leftalign">You01</td>
     <td class="col7 leftalign">Builder02</td>
     <td class="col8 leftalign">xChecker02</td>
     <td class="col9 centeralign">yes</td>

我想要得到：

Hostname     Product + Region   Date added

HostName01   EU                 2007-01-01

HostName02   U.S                2008-09-29

之前，我尝试过剥离HTML标记并使用awk，尽管表中的一些列是空的。这意味着我没有得到所有行的第1、3和4列。

我正在尝试使用：

xmllint --html --shell --format table.log <<< "cat //table/tr/th/td[1]/text()"

这给了我第二列，我试过"“，它不起作用，我不确定如何一次获得多列。

xmllint

xml

bash

xpath

html-parsing

回答 2

Stack Overflow用户

发布于 2015-09-10 03:57:47

您可以执行以下操作：

运行xmllint --xpath和一个XPath表达式，该表达式使用position()=仅获取第1、3和4列：//table/tr/*[position()=1 or position()=3 or position()=4]
pipe到perl -pe "s/<th class=\"col0/\n<th class=\"col0/g"等，剥离标记并将其拆分成单独的行
通过grep -v '^\s*$'管道剥离空行

<代码>H112管道通过末尾的<代码>D13以漂亮地打印<代码>H214<代码>F215

如下所示：

xmllint --html \
  --xpath "//table/tr/*[position()=1 or position()=3 or position()=4]" \
    table.log \
    | perl -pe "s/<th class=\"col0/\n<th class=\"col0/g" \
    | perl -pe 's/<tr[^>]+>//' \
    | perl -pe 's/<\/tr>//' \
    | perl -pe 's/<t[dh][^>]*>//' \
    | perl -pe 's/<\/t[dh]><t[dh][^>]*>/|/g' \
    | perl -pe 's/<\/t[dh]>//' \
    | grep -v '^\s*$' \
    | column -t -s '|'

上面假设HTML文档位于文件table.log中(对于HTML文件来说，这似乎是一个奇怪的名称，但它似乎是问题…中使用的名称)。。如果文档实际上在其他某个*.html文件中，当然只需输入实际的文件名。

这将为您提供如下输出：

hostname    Product + Region  date added
HostName01  EU                2007-01-01
HostName02  U.S               2008-09-29

票数 3

Stack Overflow用户

发布于 2015-09-09 23:39:01

假设您的html是格式良好的xml，xmlstarlet可以做到：

xmlstarlet sel -t -m '//table/tr' -v '*[contains(@class,"col0")]' -o $'\t' \
                                  -v '*[contains(@class,"col2")]' -o $'\t' \
                                  -v '*[contains(@class,"col3")]' -n       \
    file.html

hostname    Product + Region    date added
HostName01  EU  2007-01-01
HostName02  U.S 2008-09-29

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32480931

复制

相似问题

问使用bash解析HTML表列
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用bash解析HTML表列EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用bash解析HTML表列
EN