文章/答案/技术大牛

发布

社区首页 >问答首页 >XPath和链接的问题

问XPath和链接的问题
EN

Stack Overflow用户

提问于 2013-03-05 18:16:24

回答 3查看 101关注 0票数 1

我第一次发帖！

我遇到的问题是，我正在使用XPath和Tag-Soup来解析网页并读取数据。由于这些是新闻文章，有时它们在内容中嵌入了链接，而这些就是我的程序的乱七八糟的东西。

我使用的XPath是storyPath = "//html:article//html:p//text()";，其中页面的结构是：

<article ...>
   <p>Some text from the story.</p>
   <p>More of the story, which proves <a href="">what a great story this is</a>!</p>
   <p>More of the story without links!</p>
</article>

我的代码与xpath求值相关：

NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
    for (int i=0; i<nL.getLength(); i++) {
        Node n = nL.item(i);

        String tmp = n.toString();
        tmp = tmp.replace("[#text:", "");
        tmp = tmp.replace("]", "");
        tmp = tmp.replaceAll("‚Äô", "'");
        tmp = tmp.replaceAll("‚Äò", "'");
        tmp = tmp.replaceAll("‚Äì", "-");
        tmp = tmp.replaceAll("¬", "");
        tmp = tmp.trim();

        story.add(tmp);
    }

this.setStory(story);
...

private void setStory(LinkedList<String> story) {
    String tmp = "";
    for (String p : story) {
        tmp = tmp + p + "\n\n";
    }

    this.story = tmp.trim();
}

这给我的输出是

Some text from the story.

More of the story, which proves 

what a great story this is

!

More of the story without links!

有没有人有办法消除这个错误？我在什么地方采取了错误的方法吗？(我知道我可以很好地使用setStory代码，但别无他法。

如果没有tmp.replace()代码，所有的结果看起来都像#text:这是一个多么棒的故事啊等等

编辑：

我仍然有麻烦，尽管可能是另一种麻烦..这里最让我头疼的是一个链接，但BBC的网站方式，链接是在单独的行上，因此它仍然读取与前面描述的相同的问题(请注意，该问题已通过给出的示例修复)。BBC页面上的代码部分是：

    <p>    Former Queens Park Rangers trainee Sterling, who 

    <a  href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a> 

    had not started a senior match for the Reds before this season.
    </p>

它在我的输出中显示为：

    Former Queens Park Rangers trainee Sterling, who 

    moved to the Merseyside club in February 2010 aged 15, 

         had not started a senior match for the Reds before this season.

tag-soup

java

xpath

回答 3

Stack Overflow用户

回答已采纳

发布于 2013-03-05 23:07:18

用System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));代替System.out.print(text.trim());

票数 1

Stack Overflow用户

发布于 2013-03-05 20:03:44

首先查找段落：storyPath = "//html:article//html:p，然后对于每个段落，使用另一个xpath查询提取所有文本，将它们连接起来，不带新行，并在段落的末尾添加两个新行。

另外，你不应该使用replaceAll("‚Äô", "'")。这是一个明确的信号，表明您打开的文件不正确。当你打开你的文件时，你需要传递一个Reader到标签汤中。您应该像这样初始化读取器：Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252"));，您可以在其中为文件指定正确的字符集。这里有一个字符集的列表：http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html我猜它是Windows拉丁文1。

票数 1

Stack Overflow用户

发布于 2013-03-05 20:24:43

[#text:就是DOM文本节点的toString()表示形式。toString()方法适用于需要节点的字符串表示形式以进行调试的情况。使用返回实际文本的getTextContent()代替toString()。

如果您不希望链接内容出现在单独的行上，那么可以从XPath中删除//text()，直接获取元素节点的textContent (元素的getTextContent()将返回所有后代文本节点的串联)

String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
    Node n = nL.item(i);
    story.add(n.getTextContent().trim());
}

事实上，您必须手动修复像"‚Äô"这样的东西，这表明您的HTML码实际上是用UTF8编码的，但是您正在使用单字节字符集读取它，比如Windows1252。而不是试图在事后修复它，相反，您应该首先了解如何以正确的编码读取数据。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15221214

复制

相似问题

问XPath和链接的问题
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XPath和链接的问题EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问XPath和链接的问题
EN