我有我的HTML源代码,我从一个网站获得使用Jsoup.connect()方法。下面是来自该HTML源的一段代码(链接:https://learn.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community)
.....
<p>When you set dependencies in your VSIX manifest, you must specify Component IDs
only. Use the tables on this page to determine our minimum component dependencies.
In some scenarios, this might mean that you specify only one component from a workload.
In other scenarios, it might mean that you specify multiple components from a single
workload or multiple components from multiple workloads. For more information, see
the
<a href="../extensibility/how-to-migrate-extensibility-projects-to-visual-studio-2017" data-linktype="relative-path">How to: Migrate Extensibility Projects to Visual Studio 2017</a> page.</p>
.....
<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>
.....
<h2 id="see-also">See also</h2>
.....我想使用jsoup做的是,我想删除<h2 id="visual-studio-core-editor-included-with-visual-studio-community-2017">Visual Studio core editor (included with Visual Studio Community 2017)</h2>之前的每一个Html片段
,以及(包括) <h2 id="see-also">See also</h2>之后的一切
我有这样的解决方案,但这对我来说几乎没有用:
try {
document = Jsoup.connect(Constants.URL).get();
}
catch (IOException iex) {
iex.printStackTrace();
}
document = Parser.parse(document.toString().replaceAll(".*?<a href=\"workload-and-component-ids\" data-linktype=\"relative-path\">Visual Studio 2017 Workload and Component IDs</a> page.</p>", "") , Constants.URL);
document = Parser.parse(document.toString().replaceAll("<h2 id=\"see-also\">See also</h2>?.*", "") , Constants.URL);
return null;任何帮助都将不胜感激。
发布于 2017-05-12 14:34:56
简单的方法可以是:将页面的整个html作为字符串,为您需要的部分创建一个子字符串,并再次使用jsoup解析该子字符串。
Document doc = Jsoup.connect("https://learn.microsoft.com/en-us/visualstudio/install/workload-component-id-vs-community").get();
String html = doc.html().substring(doc.html().indexOf("visual-studio-core-editor-included-with-visual-studio-community-2017")-8,
doc.html().indexOf("unaffiliated-components")-8);
Document doc2 = Jsoup.parse(html);
System.out.println(doc2);发布于 2017-05-12 16:35:31
我将对上面@厄立特里亚的回答做一个小小的改动。为了获得所需的输出,我需要做一些小的修改。
document = Jsoup.parse(document.html().substring(document.html().indexOf("visual-studio-core-editor-included-with-visual-studio-community-2017")-26,
document.html().indexOf("see-also")-8));
System.out.println(document);https://stackoverflow.com/questions/43935780
复制相似问题