文章/答案/技术大牛

发布

问将Twine解析为JSON
EN

Code Review用户

提问于 2015-11-06 11:46:07

回答 1查看 2.7K关注 0票数 3

对于那些不知道的人来说，麻绳只是一个简单的交互式小说制作工具。它可以让你轻松地创建一系列相互链接的段落，让你选择自己的冒险风格结构。它以HTML格式导出，但如果您只想使用T电平来编写节点以便在其他地方使用，则缺乏任何其他导出格式。我认为JSON是一种更有价值的格式，所以我决定制作这个解析器。

然而，源数据有点混乱，如下所示：

<tw-storydata name="Sample" startnode="1" creator="Twine" creator-version="2.0.8" ifid="1A382346-FBC1-411F-837E-BAB9EE2FB2E9" format="Harlowe" options=""><style role="stylesheet" id="twine-user-stylesheet" type="text/twine-css"></style><script role="script" id="twine-user-script" type="text/twine-javascript"></script><tw-passagedata pid="1" name="Passage_A" tags="" position="197,62">[[Passage B]]
[[Go to passage C|Passage C]]</tw-passagedata><tw-passagedata pid="2" name="Passage_B" tags="tag-2" position="114,225">This is passage B
[[Passage B]] 
[[Passage A]] </tw-passagedata><tw-passagedata pid="3" name="Passage_C" tags="tag-1 tag-2" position="314,225">This passage goes nowhere.</tw-passagedata></tw-storydata>

如果不清楚的话(因为一开始我不清楚)，只有当实际的文本段落包含换行符时才会出现换行符。否则，所有的标签都会在同一条线上不停地运行。这对解析来说并不理想，尤其是如果我想逐行阅读的话。因此，这个过程的第一步是调用我的reformat_html函数，它将标记分隔成一行，并将段落放在一行上：

<tw-storydata name="Sample" startnode="1" creator="Twine" creator-version="2.0.8" ifid="1A382346-FBC1-411F-837E-BAB9EE2FB2E9" format="Harlowe" options="">
<style role="stylesheet" id="twine-user-stylesheet" type="text/twine-css">
</style>
<script role="script" id="twine-user-script" type="text/twine-javascript">
</script>
<tw-passagedata pid="1" name="Passage_A" tags="" position="197,62">
[[Passage B]]
[[Go to passage C|Passage C]]
</tw-passagedata>
<tw-passagedata pid="2" name="Passage_B" tags="tag-2" position="114,225">
This is passage B
[[Passage B]] 
[[Passage A]] 
</tw-passagedata>
<tw-passagedata pid="3" name="Passage_C" tags="tag-1 tag-2" position="314,225">
This passage goes nowhere.
</tw-passagedata>
</tw-storydata>

现在，我可以轻松地逐行读取它，解析起始标记中的键值对，解析与标记分离的段落文本，然后知道每个标记何时关闭。现在可以使用我的read_as_json函数将这个整理好的html读入json，产生如下结果：

{
    "style": {
        "type": "text/twine-css", 
        "role": "stylesheet", 
        "id": "twine-user-stylesheet"
    }, 
    "script": {
        "type": "text/twine-javascript", 
        "role": "script", 
        "id": "twine-user-script"
    }, 
    "tw-passagedata": [
        {
            "position": "197,62", 
            "text": "[[Passage B]]\n[[Go to passage C|Passage C]]\n", 
            "pid": "1", 
            "name": "Passage_A", 
            "tags": ""
        }, 
        {
            "position": "114,225", 
            "text": "This is passage B\n[[Passage B]] \n[[Passage A]] \n", 
            "pid": "2", 
            "name": "Passage_B", 
            "tags": "tag-2"
        }, 
        {
            "position": "314,225", 
            "text": "This passage goes nowhere.\n\n", 
            "pid": "3", 
            "name": "Passage_C", 
            "tags": "tag-1 tag-2"
        }
    ], 
    "tw-storydata": {
        "startnode": "1", 
        "name": "Sample", 
        "format": "Harlowe", 
        "creator": "Twine", 
        "creator-version": "2.0.8", 
        "ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9", 
        "options": ""
    }
}

显然，这是一个很小的例子，我还没有做任何实际的分析短文(即。(超链接或格式化)，但我想得到一些关于我迄今所做工作的反馈。我的一些解析感觉很烦人，但我想不出真正优雅的方法来检查字符是否在引号之外。

另外，我以前确实将<、>和"字符作为常量，但是QUOTETAG和CLOSETAG的名称感觉没有那么有意义，特别是当注释使上下文变得清晰时。

我特别想知道这是如何可读性和准确性。我以前还没有真正做过解析，所以我可能犯了一些天真的错误。我通常也不会编写需要被其他程序员使用或扩展的代码，所以让它们变得容易是很重要的。

from json import dump
from pprint import pprint


PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"
CLOSETAG_PARSE_ERROR = "Can't parse close tag in {}"


def write_passage(out_file, line):
    """Check how much of line is passage data and write it to out_file

    Returns what remains of the truncated line."""

    end_index = line.find('<')
    if end_index == -1:
        out_file.write(line)
        # Used up all the line as plain passage data.
        return ''
    else:
        # Need a newline so that the tag is separate from the passage data.
        out_file.write(line[:end_index] + '\n')
        return line[end_index:]


def next_quote(line, index):
    """Return the index of the next quote
    Catches a -1 result, not catching this causes infinite loops.
    Add 1 as that's needed for all future searches."""

    quote_index = line[index:].find('"')
    if quote_index == -1
        return 0
    return index + 1 + quote_index


def find_closing_tag(line):
    """Returns the index of the closing tag in line.

    Ensures that it doesn't return a > enclosed in quotes.
    This is because that may just be a character in a string value."""

    close_index = line.find('>')
    quote_index = line.find('"')

    # We need to ensure > isn't enclosed in quotes
    if quote_index != -1:
        # Keep searching until we find a valid closing tag
        while quote_index < close_index:
            quote_index = next_quote(line, quote_index)

            if quote_index > close_index:
                # Find the next > after "
                close_index = (quote_index +
                               line[quote_index:].find('>'))

            # Find the next quote that opens a keyvalue
            quote_index = next_quote(line, quote_index)
            if close_index == -1:
                raise ValueError(CLOSETAG_PARSE_ERROR.format(line))

    return close_index


def reformat_html(filepath):
    """Read Twine2's HTML format and write it out in a tidier format.

    Writes to the same directoy as filepath, just with _temp in the name.
    Returns the filepath of the resulting file."""

    output_file = filepath.replace('.html', '_temp.html')
    with open(filepath) as in_file, open(output_file, 'w') as out_file:
        for line in in_file:
            while line:
                # If it's a passage.
                if not line.startswith('<'):
                    line = write_passage(out_file, line)
                    continue

                close_index = find_closing_tag(line)
                out_file.write(line[:close_index + 1] + '\n')
                line = line[close_index + 1:]

    return output_file


def read_as_json(filepath):
    """Return a dictionary of data from the parsed file at filepath.

    Reads whether a line is a tag, tag closer or text from a passage.
    Close tags are ignored, tag data and passages are parsed into data."""

    data = {}
    with open(filepath) as f:
        for line in f:
            if line.startswith('</'):
                # Closing tag, nothing to see here.
                continue

            if line.startswith('<'):
                # New tag, parse it into data then go to the next line
                parse_tag(line, data)
                continue

            # Anything else is passage data
            # Concatenate it to the current passage node.
            data[PASSAGE_TAG][-1]['text'] += line

    return data


def separate_tags(tag):
    """Takes a tag string and returns the key name and a dictof tag values.

    Tags are strings in the format:
    <tagname key="value" key="another value">

    They're parsed by stripping the <>, then splitting off the tagname.
    Then the rest of the string is read and removed one by one.
    Space and " characters need to be checked to determine whether a space is
    a new keyvalue pair or part of the current value in quotation marks."""

    tagdata = {}
    tag = tag.strip().strip('<>')
    tagname, pairs = tag.split(' ', 1)

    # Makes each loop the same ie always seeking a space character
    pairs += ' '
    while pairs:
        # Find the second quotation mark
        quote_index = pairs.find('"')
        quote_index = 1 + pairs[quote_index + 1:].find('"')

        # If there's no quote found, just find the next space.
        if quote_index == -1:
            space_index = pairs.find(' ')
        # Otherwise find the space after the second quote
        else:
            space_index = quote_index + pairs[quote_index:].find(' ')

        # Add the keyvalue pair that's
        key, value = pairs[:space_index].split('=')
        tagdata[key] = value.strip('"')

        pairs = pairs[space_index + 1:]

    return tagname, tagdata


def parse_tag(tag, data):
    """Parse Twine tag into the data dictionary which is modified in place.

    The tag name is the key, it's value is a dictionary of the tag's key value
    pairs. Passage tags are stored in a list, as of now no other tag should
    be stored this way, and having multiple tags raises a ValueError."""

    tagname, tagdata = separate_tags(tag)
    if tagname == PASSAGE_TAG:
        # Create text string to be available for concatenating to later.
        tagdata['text'] = ''
        try:
            data[tagname].append(tagdata)
        except KeyError:
            data[tagname] = [tagdata]
    else:
        if tagname in data:
            raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
        data[tagname] = tagdata


if __name__ == "__main__":
    # Sample test
    inpath = r'Sample Data\TwineInput.html'
    outpath = r'Sample Data\FinalOutput.json'
    result = reformat_html(inpath)
    data = read_as_json(result)
    with open(outpath, 'w') as f:
        dump(data, f, indent=4)

python

parsing

python-2.x

回答 1

Code Review用户

回答已采纳

发布于 2015-11-06 14:37:28

别再发明轮子了。要解析HTML/XML，请使用HTML/XML解析器。不管布局看起来多么棘手，只要格式良好的数据被输入到它们中，它们就应该处理它。这是他们的工作。

根据您的示例输入，我将假设twine生成格式良好的XML文件。因此，您可以摆脱自定义标记拆分/解析，并使用您选择的解析器。

例如，标准库附带了xml.etree.ElementTree。您可以使用它解析您的文件如下：

import xml.etree.ElementTree as ETree

inpath = r'Sample Data\TwineInput.html'
xml = ETree.parse(inpath)
for element in xml.getroot():
    print(element.tag, element.attrib)

其中的指纹：

style {'role': 'stylesheet', 'id': 'twine-user-stylesheet', 'type': 'text/twine-css'}
script {'role': 'script', 'id': 'twine-user-script', 'type': 'text/twine-javascript'}
tw-passagedata {'position': '197,62', 'name': 'Passage_A', 'pid': '1', 'tags': ''}
tw-passagedata {'position': '114,225', 'name': 'Passage_B', 'pid': '2', 'tags': 'tag-2'}
tw-passagedata {'position': '314,225', 'name': 'Passage_C', 'pid': '3', 'tags': 'tag-1 tag-2'}

很接近你要找的东西。

接下来要做的是处理多个tw-passagedata标记，添加一个text属性，处理根tw-storydata的情况，并可能使用MULTIPLE_TAG_ERROR消息处理重复标记：

import xml.etree.ElementTree as ETree
from json import dump

PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"

def parse_twine_tag(element, data):
    """Parse Twine tag into the data dictionary which is modified in place.

    The tag name is the key, it's value is a dictionary of the tag's key value
    pairs. Passage tags are stored in a list, as of now no other tag should
    be stored this way, and having multiple tags raises a ValueError.
    """

    tagname = element.tag
    attributes = element.attrib

    if tagname == PASSAGE_TAG:
        attributes['text'] = element.text
        data.setdefault(PASSAGE_TAG, []).append(attributes)
    elif tagname in data:
        raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
    else:
        data[tagname] = attributes

    for child in element:
        parse_twine_tag(child, data)

def parse_twine_file(filepath):
    """Return a dictionary of data from the parsed file at filepath"""

    xml = ETree.parse(filepath)
    data = dict()
    parse_twine_tag(xml.getroot(), data)
    return data

if __name__ == "__main__":
    # Sample test
    inpath = r'Sample Data\TwineInput.html'
    outpath = r'Sample Data\FinalOutput.json'

    data = parse_twine_file(inpath)
    with open(outpath, 'w') as f:
        dump(data, f, indent=4)

正如预期的那样，outpath包含：

{
    "style": {
        "role": "stylesheet", 
        "id": "twine-user-stylesheet", 
        "type": "text/twine-css"
    }, 
    "tw-passagedata": [
        {
            "position": "197,62", 
            "text": "[[Passage B]]\n[[Go to passage C|Passage C]]", 
            "name": "Passage_A", 
            "pid": "1", 
            "tags": ""
        }, 
        {
            "position": "114,225", 
            "text": "This is passage B\n[[Passage B]] \n[[Passage A]] ", 
            "name": "Passage_B", 
            "pid": "2", 
            "tags": "tag-2"
        }, 
        {
            "position": "314,225", 
            "text": "This passage goes nowhere.", 
            "name": "Passage_C", 
            "pid": "3", 
            "tags": "tag-1 tag-2"
        }
    ], 
    "script": {
        "role": "script", 
        "id": "twine-user-script", 
        "type": "text/twine-javascript"
    }, 
    "tw-storydata": {
        "startnode": "1", 
        "name": "Sample", 
        "creator-version": "2.0.8", 
        "ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9", 
        "format": "Harlowe", 
        "options": "", 
        "creator": "Twine"
    }
}

票数 2

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/109988

复制

相似问题

问将Twine解析为JSON
EN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将Twine解析为JSONEN

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将Twine解析为JSON
EN