我看过一些帖子,比如FileReader reads the file as a character stream和can be treated as whitespace if the document is handed as a stream of characters,答案是输入源实际上是字符流,而不是字节流。
然而,来自1的建议解决方案似乎不适用于UTF-16LE。尽管我使用了下面的代码:
try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}我还是有org.xml.sax.SAXParseException: Content is not allowed in prolog.的。
我看过Files.newInputStream,它确实使用了一个ChannelInputStream,它将传递字节,而不是字符。我还尝试设置InputSource对象的编码,但没有成功。我还检查了<?xml部件之前没有额外的字符(物料清单除外)。
我还想提一下,这段代码在UTF-8上运行得很好。
//编辑:我也尝试了DocumentBuilderFactory.newInstance().newDocumentBuilder().parse()和XmlInputStreamReader.next(),同样的结果。
//编辑2:尝试使用缓冲读取器。相同的结果:prolog中出现意外字符'뿯‘(代码49135 / 0xbfef);应为'<’
提前谢谢。
发布于 2019-09-10 21:21:56
为了更深入地收集一些信息:
byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
LOG.info("Has BOM and is evidently UTF_16LE");
xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);
try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}https://stackoverflow.com/questions/57870212
复制相似问题