首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何使用Python从html中获取段落

如何使用Python从html中获取段落
EN

Stack Overflow用户
提问于 2016-04-22 22:02:41
回答 2查看 198关注 0票数 0

如何从结构不良的html中获取段落?

我有一个原始的html文本:

代码语言:javascript
复制
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>

我使用Python并尝试这样做:

代码语言:javascript
复制
soup = BeautifulSoup(html)

它返回一个包含2个段落的新html文本:

代码语言:javascript
复制
<html>

<body>
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
        <br/>
    </p>
    <ul>
        <li>AA Early Childhood Education, or related field. </li>
        <li>2+ years experience in a licensed childcare facility </li>
        <li>Ability to meet state requirements, including finger print clearance. </li>
        <li>Excellent oral and written communication skills </li>
        <li>Strong organization and time management skills. </li>
        <li>Creativity in expanding children's learning through play.
            <br/> </li>
        <li>Strong classroom management skills.
            <br/> </li>
    </ul>
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
        <br/> </p>
</body>

</html>

但这不是我所期望的。因此,我想得到这个html文本:

代码语言:javascript
复制
<html>

<body>
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
        AA Early Childhood Education, or related field.
        2+ years experience in a licensed childcare facility
        Ability to meet state requirements, including finger print clearance.
        Excellent oral and written communication skills
        Strong organization and time management skills.
        Creativity in expanding children's learning through play.
        Strong classroom management skills.
    </p>
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p>
</body>

</html>

为了超越html,我认为最好的方法是从原始html中删除除<p></p>之外的所有html标记。

为此,我尝试了以下正则表达式:

代码语言:javascript
复制
new_html = re.sub('<[^<]+?>', '', html)

显然,常规expession删除了所有html标记。因此,如何删除除<p></p>以外的所有html标记

如果有人帮我写r.e。然后,我将new_html提供给BeautifulSoup(),并获得我所期望的html。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-04-22 22:28:56

简短回答

new_html = re.sub('<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', '', html)

长答案

你的原版看起来怪怪的。我会把[^>]放在[^<]而不是[^<]。你想要“任何不是结束标签的东西”。

另外,把+?放在一起也很奇怪。

+的意思是:“重复1次或更多时间”

?的意思是:“重复0或一次”。

有这两种迹象是很奇怪的。

总之,我们可以这样表达你的愿望:

“打开标签”,然后“任何不是'p‘而不是/p的东西”,然后“关闭标签”。

相当于:

“打开标记”,然后是“不是‘p’的唯一字符”,或者“任何不是斜杠然后是一个或多个字符的字符”,或者“一个斜杠然后不是‘p’的唯一字符”,或者“斜杠然后是两个或多个字符”,然后是“关闭标记”。

相当于:

然后是< ( [^p][^>/][^>]+/[^p]/[^>][^>]+ )然后是>

这就是上面的正则表达式。

下面是输入python控制台的快速测试:

代码语言:javascript
复制
re.sub(
    '<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', 
    '', 
    'aa <p> bb <a> cc <li> dd <pp> ee <pa> ff </p> gg </a> hh </li> ii </pp> jj </pa> ff')
票数 1
EN

Stack Overflow用户

发布于 2016-04-22 22:22:42

这是一种手动文档操作,但是,您可以遍历li元素,并在附加之后对第一段进行删除。然后,还删除ul元素:

代码语言:javascript
复制
from bs4 import BeautifulSoup


data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
    <br>
</p>"""

soup = BeautifulSoup(data, "lxml")

p = soup.p
for li in soup.find_all("li"):
    p.append(li.get_text())
    li.extract()

soup.find("ul").extract()
print(soup.prettify())

如您计划的那样打印2段:

代码语言:javascript
复制
<html>
 <body>
  <p>
   This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
   <br/>
   AA Early Childhood Education, or related field.
   2+ years experience in a licensed childcare facility
   Ability to meet state requirements, including finger print clearance.
   Excellent oral and written communication skills
   Strong organization and time management skills.
   Creativity in expanding children's learning through play.
   Strong classroom management skills.
  </p>
  <p>
   The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
   <br/>
  </p>
 </body>
</html>

请注意,在lxmlhtml.parserhtml5lib解析您发布的输入HTML的方式上有一个重要的区别。html5libhtml.parser不会自动创建第一段,使上面的代码具有lxml的特性。

一种更好的方法可能是单独创建一个“汤”对象。示例:

代码语言:javascript
复制
from bs4 import BeautifulSoup


data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
    <li>AA Early Childhood Education, or related field.  </li>
    <li>2+ years experience in a licensed childcare facility  </li>
    <li>Ability to meet state requirements, including finger print clearance.  </li>
    <li>Excellent oral and written communication skills  </li>
    <li>Strong organization and time management skills.  </li>
    <li>Creativity in expanding children's learning through play.<br>  </li>
    <li>Strong classroom management skills.<br>  </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
    <br>
</p>"""

soup = BeautifulSoup(data, "lxml")

# create new soup
new_soup = BeautifulSoup("<body></body>", "lxml")
new_body = new_soup.body

# create first paragraph
first_p = new_soup.new_tag("p")
first_p.append(soup.p.get_text())

for li in soup.find_all("li"):
    first_p.append(li.get_text())

new_body.append(first_p)

# create second paragraph
second_p = soup.find_all("p")[-1]
new_body.append(second_p)

print(new_soup.prettify())

指纹:

代码语言:javascript
复制
<html>
 <body>
  <p>
   This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
   AA Early Childhood Education, or related field.
   2+ years experience in a licensed childcare facility
   Ability to meet state requirements, including finger print clearance.
   Excellent oral and written communication skills
   Strong organization and time management skills.
   Creativity in expanding children's learning through play.
   Strong classroom management skills.
  </p>
  <p>
   The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
   <br/>
  </p>
 </body>
</html>
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/36803990

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档