我需要执行从HTML中提取文本和表情符号(我无法控制我得到的HTML )。我发现使用以下函数删除HTML标记相当简单;但是,它去掉了嵌入在<img>标记中的emojis。结果应该是纯文本+表情符号。
我不太关心空间,但越干净越好。
// this cleans the HTML quite well, but I need to extend it to keep the emojis
const stripTags = (html: string, ...args) => {
return html.replace(/<(\/?)(\w+)[^>]*\/?>/g, (_, endMark, tag) => {
return args.includes(tag) ? "<" + endMark + tag + ">" : ""
}).replace(/<!--.*?-->/g, "")
}<div>
<div class="text-bold">
<span dir="auto">
<div>
<div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
</div>
<div class="">
<div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
</div>
<div class="">
<div dir="auto" style="text-align: start;">
<span class=""><img height="16" width="16" alt="" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House…
<div class="" role="button" tabindex="0">Something else</div>
</div>
</div>
</span>
</div>
</div>预期产出:
Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.
he's (mostly) a Beagle and Jack Russell mix. : @House… Something else.发布于 2022-08-19 10:32:19
如果您想使用dom解析器而不是纯regexp并获得对HTML的更多控制,下面是如何实现这个示例:
const htmlString = "<div>your contet...</div>";
const toRawString = (htmlString) => {
if (!htmlString) {
return null;
}
const parser = new DOMParser();
const parsedHTML = parser.parseFromString(htmlString, "text/html");
// Get all images and keep only alt attribute content
// So if you need some data from other attributes you can reuse this one below
const images = parsedHTML.querySelectorAll("img");
images.forEach((image) => {
const altSpan = document.createElement('span');
altSpan.innerHTML = image.alt;
image.parentElement.appendChild(altSpan);
image.parentElement.removeChild(image);
});
// Replace all additional spaces
return parsedHTML.body.textContent.replace(/\s\s+/g, " ");
};发布于 2022-08-19 00:35:20
最好使用DOM解析器并遍历DOM提取纯文本。
这里有一个使用regex的解决方案,如果您接受在某些角落情况下这可能会失败的话。
let html = `<div>
<div class="text-bold">
<span dir="auto">
<div>
<div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
</div>
<div class="">
<div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
</div>
<div class="">
<div dir="auto" style="text-align: start;">
<span class=""><img height="16" width="16" alt="" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House…
<div class="" role="button" tabindex="0">Something else</div>
</div>
</div>
</span>
</div>
</div>`;
let plain = html
.replace(/<img .*?alt="([^"]+)"[^>]*>/g, ' $1 ') // extract alt text from img tag
.replace(/<\/?[a-z][^>]*>/g, ' ') // remove all tags
.replace(/\s+/g, ' ').trim(); // cleanup whitespace
console.log(plain);输出:
Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix. : @House… Something else
请注意,这不是万无一失的。例如,它不支持角的情况,例如<span title="Home > Edit">,它应该写成<span title="Home > Edit">,但并不总是这样。
编辑
如果使用jQuery,则很容易遍历内存中的元素:
let html = `<div>...</div>`; // (same as above)
let elem = $(html); // create jQuery element in memory
elem.find('img').replaceWith(function() {
return ' ' + $(this).attr('alt') + ' ';
});
let plain = elem.text().replace(/\s+/g, ' ').trim();结果:
Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix. : @House… Something else
https://stackoverflow.com/questions/73410413
复制相似问题