首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从HTML中提取文本和表情符号

从HTML中提取文本和表情符号
EN

Stack Overflow用户
提问于 2022-08-18 23:27:08
回答 2查看 128关注 0票数 0

我需要执行从HTML中提取文本和表情符号(我无法控制我得到的HTML )。我发现使用以下函数删除HTML标记相当简单;但是,它去掉了嵌入在<img>标记中的emojis。结果应该是纯文本+表情符号。

我不太关心空间,但越干净越好。

代码语言:javascript
复制
// this cleans the HTML quite well, but I need to extend it to keep the emojis
const stripTags = (html: string, ...args) => {
    return html.replace(/<(\/?)(\w+)[^>]*\/?>/g, (_, endMark, tag) => {
        return args.includes(tag) ? "<" + endMark + tag + ">" : ""
    }).replace(/<!--.*?-->/g, "")
}
代码语言:javascript
复制
<div>
   <div class="text-bold">
      <span dir="auto">
         <div>
            <div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">
               <span class=""><img height="16" width="16" alt="" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House… 
               <div class="" role="button" tabindex="0">Something else</div>
            </div>
         </div>
      </span>
   </div>
</div>

预期产出:

代码语言:javascript
复制
Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.
he's (mostly) a Beagle and Jack Russell mix. : @House… Something else.
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-08-19 10:32:19

如果您想使用dom解析器而不是纯regexp并获得对HTML的更多控制,下面是如何实现这个示例:

代码语言:javascript
复制
const htmlString = "<div>your contet...</div>";

const toRawString = (htmlString) => {
  if (!htmlString) {
    return null;
  }

  const parser = new DOMParser();
  const parsedHTML = parser.parseFromString(htmlString, "text/html");

  // Get all images and keep only alt attribute content
  // So if you need some data from other attributes you can reuse this one below
  const images = parsedHTML.querySelectorAll("img");
  images.forEach((image) => {
    const altSpan = document.createElement('span');
    altSpan.innerHTML = image.alt;
    image.parentElement.appendChild(altSpan);
    image.parentElement.removeChild(image);
  });

  // Replace all additional spaces
  return parsedHTML.body.textContent.replace(/\s\s+/g, " ");
};
票数 1
EN

Stack Overflow用户

发布于 2022-08-19 00:35:20

最好使用DOM解析器并遍历DOM提取纯文本。

这里有一个使用regex的解决方案,如果您接受在某些角落情况下这可能会失败的话。

代码语言:javascript
复制
let html = `<div>
   <div class="text-bold">
      <span dir="auto">
         <div>
            <div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">
               <span class=""><img height="16" width="16" alt="" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House… 
               <div class="" role="button" tabindex="0">Something else</div>
            </div>
         </div>
      </span>
   </div>
</div>`;

let plain = html
  .replace(/<img .*?alt="([^"]+)"[^>]*>/g, ' $1 ') // extract alt text from img tag
  .replace(/<\/?[a-z][^>]*>/g, ' ') // remove all tags
  .replace(/\s+/g, ' ').trim(); // cleanup whitespace
console.log(plain);

输出:

Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix. : @House… Something else

请注意,这不是万无一失的。例如,它不支持角的情况,例如<span title="Home > Edit">,它应该写成<span title="Home &gt; Edit">,但并不总是这样。

编辑

如果使用jQuery,则很容易遍历内存中的元素:

代码语言:javascript
复制
let html = `<div>...</div>`; // (same as above)
let elem = $(html); // create jQuery element in memory
elem.find('img').replaceWith(function() {
  return ' ' + $(this).attr('alt') + ' ';
});
let plain = elem.text().replace(/\s+/g, ' ').trim();

结果:

Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix. : @House… Something else

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73410413

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档