我想在下面的示例中读取相应的头<h1>和段落<p>数据.
我有很多标题和段落是相互关联的,所以如果我找到了一个标题,那么我需要提取相应的段落数据:
<h1>Supplementary Materials </h1>\n
<p />\n
<p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
<p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
<p />
<h1>Supplementary Materials </h1>\n
<p />\n
<p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
<p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
<p />发布于 2019-01-28 15:21:12
html真的是这样重复的吗?还是这是一个错误?
html = '''<h1>Supplementary Materials </h1>\n
<p />\n
<p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
<p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
<p />
<h1>Supplementary Materials </h1>\n
<p />\n
<p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
<p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
<p /> '''
import bs4
soup = bs4.BeautifulSoup(html, 'html.parser')
heads = soup.find_all('h1')
for head in heads:
para = head.find_next('p', text=True).text
print ('Header: %s\nParagraph: %s\n' %(head.text, para))输出:
Header: Supplementary Materials
Paragraph: The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al.
Header: Testing data
Paragraph: The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.
Header: Supplementary Materials
Paragraph: The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al.
Header: Testing data
Paragraph: The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.https://stackoverflow.com/questions/54404197
复制相似问题