如何从结构不良的html中获取段落?
我有一个原始的html文本:
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.<br> </li>
<li>Strong classroom management skills.<br> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br>
</p>我使用Python并尝试这样做:
soup = BeautifulSoup(html)它返回一个包含2个短段落的新html文本:
<html>
<body>
<p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br/>
</p>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.
<br/> </li>
<li>Strong classroom management skills.
<br/> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br/> </p>
</body>
</html>但这不是我所期望的。因此,我想得到这个html文本:
<html>
<body>
<p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
AA Early Childhood Education, or related field.
2+ years experience in a licensed childcare facility
Ability to meet state requirements, including finger print clearance.
Excellent oral and written communication skills
Strong organization and time management skills.
Creativity in expanding children's learning through play.
Strong classroom management skills.
</p>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p>
</body>
</html>为了超越html,我认为最好的方法是从原始html中删除除<p>和</p>之外的所有html标记。
为此,我尝试了以下正则表达式:
new_html = re.sub('<[^<]+?>', '', html)显然,常规expession删除了所有html标记。因此,如何删除除<p>和</p>以外的所有html标记
如果有人帮我写r.e。然后,我将new_html提供给BeautifulSoup(),并获得我所期望的html。
发布于 2016-04-22 22:28:56
简短回答
new_html = re.sub('<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', '', html)
长答案
你的原版看起来怪怪的。我会把[^>]放在[^<]而不是[^<]。你想要“任何不是结束标签的东西”。
另外,把+和?放在一起也很奇怪。
+的意思是:“重复1次或更多时间”
?的意思是:“重复0或一次”。
有这两种迹象是很奇怪的。
总之,我们可以这样表达你的愿望:
“打开标签”,然后“任何不是'p‘而不是/p的东西”,然后“关闭标签”。
相当于:
“打开标记”,然后是“不是‘p’的唯一字符”,或者“任何不是斜杠然后是一个或多个字符的字符”,或者“一个斜杠然后不是‘p’的唯一字符”,或者“斜杠然后是两个或多个字符”,然后是“关闭标记”。
相当于:
然后是< ( [^p]或[^>/][^>]+或/[^p]或/[^>][^>]+ )然后是>
这就是上面的正则表达式。
下面是输入python控制台的快速测试:
re.sub(
'<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>',
'',
'aa <p> bb <a> cc <li> dd <pp> ee <pa> ff </p> gg </a> hh </li> ii </pp> jj </pa> ff')发布于 2016-04-22 22:22:42
这是一种手动文档操作,但是,您可以遍历li元素,并在附加之后对第一段进行删除。然后,还删除ul元素:
from bs4 import BeautifulSoup
data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.<br> </li>
<li>Strong classroom management skills.<br> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br>
</p>"""
soup = BeautifulSoup(data, "lxml")
p = soup.p
for li in soup.find_all("li"):
p.append(li.get_text())
li.extract()
soup.find("ul").extract()
print(soup.prettify())如您计划的那样打印2段:
<html>
<body>
<p>
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br/>
AA Early Childhood Education, or related field.
2+ years experience in a licensed childcare facility
Ability to meet state requirements, including finger print clearance.
Excellent oral and written communication skills
Strong organization and time management skills.
Creativity in expanding children's learning through play.
Strong classroom management skills.
</p>
<p>
The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br/>
</p>
</body>
</html>请注意,在lxml、html.parser和html5lib解析您发布的输入HTML的方式上有一个重要的区别。html5lib和html.parser不会自动创建第一段,使上面的代码具有lxml的特性。
一种更好的方法可能是单独创建一个“汤”对象。示例:
from bs4 import BeautifulSoup
data = """
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.<br> </li>
<li>Strong classroom management skills.<br> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br>
</p>"""
soup = BeautifulSoup(data, "lxml")
# create new soup
new_soup = BeautifulSoup("<body></body>", "lxml")
new_body = new_soup.body
# create first paragraph
first_p = new_soup.new_tag("p")
first_p.append(soup.p.get_text())
for li in soup.find_all("li"):
first_p.append(li.get_text())
new_body.append(first_p)
# create second paragraph
second_p = soup.find_all("p")[-1]
new_body.append(second_p)
print(new_soup.prettify())指纹:
<html>
<body>
<p>
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
AA Early Childhood Education, or related field.
2+ years experience in a licensed childcare facility
Ability to meet state requirements, including finger print clearance.
Excellent oral and written communication skills
Strong organization and time management skills.
Creativity in expanding children's learning through play.
Strong classroom management skills.
</p>
<p>
The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br/>
</p>
</body>
</html>https://stackoverflow.com/questions/36803990
复制相似问题