提供的用于html文件的MS IFilter nlhtml.dll (文件版本2008.0.9600.17415)有时无法提取标记外部的文本并返回标记内容,但在某些html文件上工作正常。从https://github.com/Sicos1977/IFilterTextReader的C#文本提取器调用IFilter。
IFilter的参数包括
const NativeMethods.IFILTER_INIT iflags = NativeMethods.IFILTER_INIT.CANON_HYPHENS |
NativeMethods.IFILTER_INIT.CANON_PARAGRAPHS |
NativeMethods.IFILTER_INIT.CANON_SPACES |
NativeMethods.IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
NativeMethods.IFILTER_INIT.HARD_LINE_BREAKS |
NativeMethods.IFILTER_INIT.FILTER_OWNED_VALUE_OK;不过,为索引搜索提取的文本似乎还可以,我想这将使用相同的IFilter。如何使用nlhtml.dll仅提取标签外的文本,就像MS似乎能够创建搜索索引一样?
解压正常的文件开头如下所示:
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-16">
<title>Electronic Activity Statement</title>
</head>
<body>
<H5>
Main Name: ALEKA CONSULTING PTY LTD<BR>
ABN: 89 160 421 821<BR><BR>和文件的开头,其中提取包括标签内容
<!DOCTYPE html>
<!--[if IE 6]>
<html class="no-js" id="ie6" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html class="no-js" id="ie7" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html class="no-js" id="ie8" dir="ltr" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8) ]><!-->
<html class="no-js" dir="ltr" lang="en-US">
<!--<![endif]-->从该文件中提取的文本包括标记内容,并以
[if IE 6]> <html class="no-js" id="ie6" dir="ltr" lang="en-US"> <![endif]
[if IE 7]> <html class="no-js" id="ie7" dir="ltr" lang="en-US"> <![endif]
[if IE 8]> <html class="no-js" id="ie8" dir="ltr" lang="en-US"> <![endif]
[if !(IE 6) | !(IE 7) | !(IE 8) ]><! <![endif] Mirrored from
nrha.org.au/12nrhc/musical-delegates-wanted/?pfstyle=wp by HTTrack
Website Copier/3.x [XR&CO'2013], Sat, 27 Jun 2015 11:36:14 GMT
Added by HTTrack /Added by HTTrack 发布于 2015-06-29 07:31:05
从某些html文件中提取标记内容被阻止从IFilter标志中删除Apply_Index_Attributes,如下所示:
const NativeMethods.IFILTER_INIT iflags = NativeMethods.IFILTER_INIT.CANON_HYPHENS |
NativeMethods.IFILTER_INIT.CANON_PARAGRAPHS |
NativeMethods.IFILTER_INIT.CANON_SPACES |
//NativeMethods.IFILTER_INIT.APPLY_INDEX_ATTRIBUTES |
NativeMethods.IFILTER_INIT.HARD_LINE_BREAKS |
NativeMethods.IFILTER_INIT.FILTER_OWNED_VALUE_OK;https://stackoverflow.com/questions/31098099
复制相似问题