文章/答案/技术大牛

发布

社区首页 >问答首页 >Jsoup WhiteList允许注释

问Jsoup WhiteList允许注释
EN

Stack Overflow用户

提问于 2014-06-05 14:46:30

回答 2查看 2.4K关注 0票数 6

我使用的是带有Whitelist自定义配置的jsoup 1.7.3。

显然，它清除了文档中的所有HTML ()。

它还对<!DOCTYPE ...>元素进行消毒。

如何才能让jsoup Whitelist按原样允许评论呢？
如何将!DOCTYPE元素定义为具有任何属性的允许元素？

java

jsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-06-05 17:13:11

这在标准JSoup类中是不可能的，它不依赖于WhiteList。这是org.jsoup.safety.Cleaner。清洁器使用节点遍历器，它只允许元素和文本节点。也只有身体被解析。因此，头部和doctype被完全忽略。因此，要实现这一点，您必须创建一个自定义清洁剂。例如，如果您有类似的html

<!DOCTYPE html>
<html>
    <head>
    <!-- This is a script -->
    <script type="text/javascript">
        function newFun() {
            alert(1);
        }
    </script>
    </head>
    <body>
        <map name="diagram_map">
            <area id="area1" />
            <area id="area2" />
        </map>
        <!-- This is another comment. -->
        <div>Test</div>
    </body>
</html>

您将首先创建一个自定义清洁剂，复制原版清洁剂。但是，请注意包应该是org.jsoup.safety的，因为清洁剂使用了一些受保护的白名单方法。此外，扩展Cleaner没有意义，因为几乎所有方法都是私有的，内部节点遍历器是最终的。

package org.jsoup.safety;

import org.jsoup.helper.Validate;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Comment;
import org.jsoup.nodes.DataNode;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.DocumentType;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Tag;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;

public class CustomCleaner {
    private Whitelist whitelist;

    public CustomCleaner(Whitelist whitelist) {
    Validate.notNull(whitelist);
    this.whitelist = whitelist;
    }

    public Document clean(Document dirtyDocument) {
    Validate.notNull(dirtyDocument);

    Document clean = Document.createShell(dirtyDocument.baseUri());
    copyDocType(dirtyDocument, clean);
    if (dirtyDocument.head() != null) 
        copySafeNodes(dirtyDocument.head(), clean.head());
    if (dirtyDocument.body() != null) // frameset documents won't have a body. the clean doc will have empty body.
        copySafeNodes(dirtyDocument.body(), clean.body());

    return clean;
    }

    private void copyDocType(Document dirtyDocument, Document clean) {
    dirtyDocument.traverse(new NodeVisitor() {
        public void head(Node node, int depth) {
        if (node instanceof DocumentType) {
            clean.prependChild(node);
        }
        }
        public void tail(Node node, int depth) { }
    });
    }

    public boolean isValid(Document dirtyDocument) {
    Validate.notNull(dirtyDocument);

    Document clean = Document.createShell(dirtyDocument.baseUri());
    int numDiscarded = copySafeNodes(dirtyDocument.body(), clean.body());
    return numDiscarded == 0;
    }

    private final class CleaningVisitor implements NodeVisitor {
    private int numDiscarded = 0;
    private final Element root;
    private Element destination; // current element to append nodes to

    private CleaningVisitor(Element root, Element destination) {
        this.root = root;
        this.destination = destination;
    }

    public void head(Node source, int depth) {
        if (source instanceof Element) {
        Element sourceEl = (Element) source;

        if (whitelist.isSafeTag(sourceEl.tagName())) { // safe, clone and copy safe attrs
            ElementMeta meta = createSafeElement(sourceEl);
            Element destChild = meta.el;
            destination.appendChild(destChild);

            numDiscarded += meta.numAttribsDiscarded;
            destination = destChild;
        } else if (source != root) { // not a safe tag, so don't add. don't count root against discarded.
            numDiscarded++;
        }
        } else if (source instanceof TextNode) {
        TextNode sourceText = (TextNode) source;
        TextNode destText = new TextNode(sourceText.getWholeText(), source.baseUri());
        destination.appendChild(destText);
        } else if (source instanceof Comment) {
        Comment sourceComment = (Comment) source;
        Comment destComment = new Comment(sourceComment.getData(), source.baseUri());
        destination.appendChild(destComment);
        } else if (source instanceof DataNode) {
        DataNode sourceData = (DataNode) source;
        DataNode destData = new DataNode(sourceData.getWholeData(), source.baseUri());
        destination.appendChild(destData);
        } else { // else, we don't care about comments, xml proc instructions, etc
        numDiscarded++;
        }
    }

    public void tail(Node source, int depth) {
        if (source instanceof Element && whitelist.isSafeTag(source.nodeName())) {
        destination = destination.parent(); // would have descended, so pop destination stack
        }
    }
    }

    private int copySafeNodes(Element source, Element dest) {
    CleaningVisitor cleaningVisitor = new CleaningVisitor(source, dest);
    NodeTraversor traversor = new NodeTraversor(cleaningVisitor);
    traversor.traverse(source);
    return cleaningVisitor.numDiscarded;
    }

    private ElementMeta createSafeElement(Element sourceEl) {
    String sourceTag = sourceEl.tagName();
    Attributes destAttrs = new Attributes();
    Element dest = new Element(Tag.valueOf(sourceTag), sourceEl.baseUri(), destAttrs);
    int numDiscarded = 0;

    Attributes sourceAttrs = sourceEl.attributes();
    for (Attribute sourceAttr : sourceAttrs) {
        if (whitelist.isSafeAttribute(sourceTag, sourceEl, sourceAttr))
        destAttrs.put(sourceAttr);
        else
        numDiscarded++;
    }
    Attributes enforcedAttrs = whitelist.getEnforcedAttributes(sourceTag);
    destAttrs.addAll(enforcedAttrs);

    return new ElementMeta(dest, numDiscarded);
    }

    private static class ElementMeta {
    Element el;
    int numAttribsDiscarded;

    ElementMeta(Element el, int numAttribsDiscarded) {
        this.el = el;
        this.numAttribsDiscarded = numAttribsDiscarded;
    }
    }

}

一旦你两者兼得，你就可以像往常一样做清洁了。喜欢

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.CustomCleaner;
import org.jsoup.safety.Whitelist;

public class CustomJsoupSanitizer {

    public static void main(String[] args) {
        try {
            Document doc = Jsoup.parse(new File("t2.html"), "UTF-8");
            CustomCleaner cleaner = new CustomCleaner(Whitelist.relaxed().addTags("script"));
            Document doc2 = cleaner.clean(doc);
            System.out.println(doc2.html());
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

}

这将为上面的html提供经过消毒的输出，如

<!DOCTYPE html>
<html>
 <head> 
  <!-- This is a script --> 
  <script>
        function newFun() {
            alert(1);
        }
  </script> 
 </head>
 <body>     
  <!-- This is another comment. --> 
  <div>
   Test
  </div>   
 </body>
</html>

您可以自定义清洁剂以满足您的要求。即为了避免头节点或脚本标签等.

票数 4

Stack Overflow用户

发布于 2014-06-05 16:57:51

更干净的在这里没有给你任何机会(1.100)：

} else { // else, we don't care about comments, xml proc instructions, etc
    numDiscarded++;
}

只有Element和TextNode的实例可以保留在清理后的HTML中。

您唯一的机会可能是一些可怕的事情，比如解析文档，用一个特殊的白色标记替换注释和doctype，清理文档，然后再解析和替换特殊的标记。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24063299

复制

相似问题

问Jsoup WhiteList允许注释
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Jsoup WhiteList允许注释EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Jsoup WhiteList允许注释
EN