文章/答案/技术大牛

发布

社区首页 >问答首页 >Python使用Soup将HTML转换为JSON

问Python使用Soup将HTML转换为JSON
EN

Stack Overflow用户

提问于 2017-09-29 06:53:41

回答 1查看 2.7K关注 0票数 0

这些是规则

HTML标记将以以下任何一个、<ol>或<ul>开头
当找到步骤1中的任何标记时，HTML的内容将只包含以下标记：、或
将步骤2标记映射到以下内容：将是JSON中的此项{"bold":True}，将为{"italics":True}，将为{"decoration":"underline"}
在JSON中找到的任何文本都是{"text": "this is the text"}

假设我有下面的HTML :通过使用以下内容：

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]

它产生这个数组：

[
    <p>The name is not mine it is for the people<span style="text-decoration: underline;"><em><strong>stephen</strong></em></span><em><strong> how can</strong>name </em><strong>good</strong> <em>his name <span style="text-decoration: underline;">moneuet</span>please </em><span style="text-decoration: underline;"><strong>forever</strong></span><em>tomorrow<strong>USA</strong></em></p>,
    <p>2</p>,
    <p><strong>moment</strong><em>Africa</em> <em>China</em> <span style="text-decoration: underline;">home</span> <em>thomas</em> <strong>nothing</strong></p>,
    <ol><li>first item</li><li><em><span style="text-decoration: underline;"><strong>second item</strong></span></em></li></ol>
]

通过应用上面的规则，这将是结果：

首先，Array元素将被处理到这个JSON中：

{
    "text": [
        "The name is not mine it is for the people",
        {"text": "stephen", "decoration": "underline", "bold": True, "italics": True}, 
        {"text": "how can", "bold": True, "italics": True},
        {"text": "name", "italics": True},
        {"text": "good", "bold": True},
        {"text": "his name", "italics": True},
        {"text": "moneuet", "decoration": "underline"},
        {"text": "please ", "italics": True},
        {"text": "forever", "decoration": "underline", "bold":True},
        {"text": "tomorrow", "italics": True},
        {"text": "USA", "bold": True, "italics": True}
    ]
}

第二个Array元素将被处理到这个JSON中：

{"text": ["2"] }

第三个Array元素将被处理到这个JSON中：

{
    "text": [
        {"text": "moment", "bold": True},
        {"text": "Africa", "italics": True},
        {"text": "China", "italics": True},
        {"text": "home", "decoration": "underline"},
        {"text": "thomas", "italics": True},
        {"text": "nothing", "bold": True}
    ]
}

第四个Array元素将被处理到这个JSON中：

{
    "ol": [
        "first item", 
        {"text": "second item", "decoration": "underline", "italics": True, "bold": True}
    ]
}

，这是我的尝试，所以，我可以钻下去。但是如何处理arrayOfTextAndStyles数组是的问题。

soup = Soup("THIS IS THE WHOLE HTML", "html.parser")
allTags = [tag for tag in soup.find_all(recursive=False)]
for foundTag in allTags:
   foundTagStyles = [tag for tag in foundTag.find_all(recursive=True)]
      if len(foundTagStyles ) > 0:
         if str(foundTag.name) == "p":
              arrayOfTextAndStyles = [{"tag": tag.name, "text": 
                  foundTag.find_all(text=True, recursive=False) }] +  
                    [{"tag":tag.name, "text": foundTag.find_all(text=True, 
                    recursive=False) } for tag in foundTag.find_all()]

         elif  str(foundTag.name) == "ol":

         elif  str(foundTag .name) == "ul":

python

html

beautifulsoup

回答 1

Stack Overflow用户

发布于 2017-09-29 08:09:11

我会使用一个函数来解析每个元素，而不是使用一个巨大的循环。选择p和ol标记，并在解析中引发异常，标记与特定规则不匹配的任何内容：

from bs4 import NavigableString

def parse(elem):
    if elem.name == 'ol':
        result = []
        for li in elem.find_all('li'):
            if len(li) > 1:
                result.append([parse_text(sub) for sub in li])
            else:
                result.append(parse_text(next(iter(li))))
        return {'ol': result}
    return {'text': [parse_text(sub) for sub in elem]}

def parse_text(elem):
    if isinstance(elem, NavigableString):
        return {'text': elem}

    result = {}
    if elem.name == 'em':
        result['italics'] = True
    elif elem.name == 'strong':
        result['bold'] = True
    elif elem.name == 'span':
        try:
            # rudimentary parse into a dictionary
            styles = dict(
                s.replace(' ', '').split(':') 
                for s in elem.get('style', '').split(';')
                if s.strip()
            )
        except ValueError:
            raise ValueError('Invalid structure')
        if 'underline' not in styles.get('text-decoration', ''):
            raise ValueError('Invalid structure')
        result['decoration'] = 'underline'
    else:
        raise ValueError('Invalid structure')

    if len(elem) > 1:
        result['text'] = [parse_text(sub) for sub in elem]
    else:
        result.update(parse_text(next(iter(elem))))
    return result

然后解析您的文档：

for candidate in soup.select('ol,p'):
    try:
        result = parse(candidate)
    except ValueError:
        # invalid structure, ignore
        continue
    print(result)

使用pprint，结果是：

{'text': [{'text': 'The name is not mine it is for the people'},
          {'bold': True,
           'decoration': 'underline',
           'italics': True,
           'text': 'stephen'},
          {'italics': True,
           'text': [{'bold': True, 'text': ' how can'}, {'text': 'name '}]},
          {'bold': True, 'text': 'good'},
          {'text': ' '},
          {'italics': True,
           'text': [{'text': 'his name '},
                    {'decoration': 'underline', 'text': 'moneuet'},
                    {'text': 'please '}]},
          {'bold': True, 'decoration': 'underline', 'text': 'forever'},
          {'italics': True,
           'text': [{'text': 'tomorrow'}, {'bold': True, 'text': 'USA'}]}]}
{'text': [{'text': '2'}]}
{'text': [{'bold': True, 'text': 'moment'},
          {'italics': True, 'text': 'Africa'},
          {'text': ' '},
          {'italics': True, 'text': 'China'},
          {'text': ' '},
          {'decoration': 'underline', 'text': 'home'},
          {'text': ' '},
          {'italics': True, 'text': 'thomas'},
          {'text': ' '},
          {'bold': True, 'text': 'nothing'}]}
{'ol': [{'text': 'first item'},
        {'bold': True,
         'decoration': 'underline',
         'italics': True,
         'text': 'second item'}]}

请注意，文本节点现在是嵌套的；这使您能够始终如一地重新创建相同的结构，并具有正确的空格和嵌套的文本装饰。

结构也相当一致；'text'键将指向单个字符串或字典列表。这样的列表永远不会混合类型。您还可以对此进行改进；让'text'只指向一个字符串，并使用一个不同的键来表示嵌套数据，例如contains或nested或类似的，然后只使用其中的一个。所需的只是在'text'情况下和parse()函数中更改parse()键。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46483313

复制

相似问题

问Python使用Soup将HTML转换为JSON
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python使用Soup将HTML转换为JSONEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python使用Soup将HTML转换为JSON
EN