文章/答案/技术大牛

发布

社区首页 >问答首页 >从HTML中提取文本和表情符号

问从HTML中提取文本和表情符号
EN

Stack Overflow用户

提问于 2022-08-18 23:27:08

回答 2查看 128关注 0票数 0

我需要执行从HTML中提取文本和表情符号(我无法控制我得到的HTML )。我发现使用以下函数删除HTML标记相当简单；但是，它去掉了嵌入在<img>标记中的emojis。结果应该是纯文本+表情符号。

我不太关心空间，但越干净越好。

// this cleans the HTML quite well, but I need to extend it to keep the emojis
const stripTags = (html: string, ...args) => {
    return html.replace(/<(\/?)(\w+)[^>]*\/?>/g, (_, endMark, tag) => {
        return args.includes(tag) ? "<" + endMark + tag + ">" : ""
    }).replace(/<!--.*?-->/g, "")
}

<div>
   <div class="text-bold">
      <span dir="auto">
         <div>
            <div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">
               <span class=""><img height="16" width="16" alt="" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House… 
               <div class="" role="button" tabindex="0">Something else</div>
            </div>
         </div>
      </span>
   </div>
</div>

预期产出：

Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.
he's (mostly) a Beagle and Jack Russell mix. : @House… Something else.

regex

emoji

javascript

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-08-19 10:32:19

如果您想使用dom解析器而不是纯regexp并获得对HTML的更多控制，下面是如何实现这个示例：

const htmlString = "<div>your contet...</div>";

const toRawString = (htmlString) => {
  if (!htmlString) {
    return null;
  }

  const parser = new DOMParser();
  const parsedHTML = parser.parseFromString(htmlString, "text/html");

  // Get all images and keep only alt attribute content
  // So if you need some data from other attributes you can reuse this one below
  const images = parsedHTML.querySelectorAll("img");
  images.forEach((image) => {
    const altSpan = document.createElement('span');
    altSpan.innerHTML = image.alt;
    image.parentElement.appendChild(altSpan);
    image.parentElement.removeChild(image);
  });

  // Replace all additional spaces
  return parsedHTML.body.textContent.replace(/\s\s+/g, " ");
};

票数 1

Stack Overflow用户

发布于 2022-08-19 00:35:20

最好使用DOM解析器并遍历DOM提取纯文本。

这里有一个使用regex的解决方案，如果您接受在某些角落情况下这可能会失败的话。

let html = `<div>
   <div class="text-bold">
      <span dir="auto">
         <div>
            <div dir="auto" style="text-align: start;">Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">he's (mostly) a Beagle and Jack Russell mix.</div>
         </div>
         <div class="">
            <div dir="auto" style="text-align: start;">
               <span class=""><img height="16" width="16" alt="" src="https://someweb.com/images/emoji/bpp/2/16/1f415.png"></span> : @House… 
               <div class="" role="button" tabindex="0">Something else</div>
            </div>
         </div>
      </span>
   </div>
</div>`;

let plain = html
  .replace(/<img .*?alt="([^"]+)"[^>]*>/g, ' $1 ') // extract alt text from img tag
  .replace(/<\/?[a-z][^>]*>/g, ' ') // remove all tags
  .replace(/\s+/g, ' ').trim(); // cleanup whitespace
console.log(plain);

输出：

Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix. : @House… Something else

请注意，这不是万无一失的。例如，它不支持角的情况，例如<span title="Home > Edit">，它应该写成<span title="Home > Edit">，但并不总是这样。

编辑

如果使用jQuery，则很容易遍历内存中的元素：

let html = `<div>...</div>`; // (same as above)
let elem = $(html); // create jQuery element in memory
elem.find('img').replaceWith(function() {
  return ' ' + $(this).attr('alt') + ' ';
});
let plain = elem.text().replace(/\s+/g, ' ').trim();

结果：

Herman is 10 and was born in Louisiana. he now lives a wonderful life in Wisconsin. he's (mostly) a Beagle and Jack Russell mix. : @House… Something else

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73410413

复制

相似问题

问从HTML中提取文本和表情符号
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从HTML中提取文本和表情符号EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从HTML中提取文本和表情符号
EN