首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将Twine解析为JSON

将Twine解析为JSON
EN

Code Review用户
提问于 2015-11-06 11:46:07
回答 1查看 2.7K关注 0票数 3

对于那些不知道的人来说,麻绳只是一个简单的交互式小说制作工具。它可以让你轻松地创建一系列相互链接的段落,让你选择自己的冒险风格结构。它以HTML格式导出,但如果您只想使用T电平来编写节点以便在其他地方使用,则缺乏任何其他导出格式。我认为JSON是一种更有价值的格式,所以我决定制作这个解析器。

然而,源数据有点混乱,如下所示:

代码语言:javascript
复制
<tw-storydata name="Sample" startnode="1" creator="Twine" creator-version="2.0.8" ifid="1A382346-FBC1-411F-837E-BAB9EE2FB2E9" format="Harlowe" options=""><style role="stylesheet" id="twine-user-stylesheet" type="text/twine-css"></style><script role="script" id="twine-user-script" type="text/twine-javascript"></script><tw-passagedata pid="1" name="Passage_A" tags="" position="197,62">[[Passage B]]
[[Go to passage C|Passage C]]</tw-passagedata><tw-passagedata pid="2" name="Passage_B" tags="tag-2" position="114,225">This is passage B
[[Passage B]] 
[[Passage A]] </tw-passagedata><tw-passagedata pid="3" name="Passage_C" tags="tag-1 tag-2" position="314,225">This passage goes nowhere.</tw-passagedata></tw-storydata>

如果不清楚的话(因为一开始我不清楚),只有当实际的文本段落包含换行符时才会出现换行符。否则,所有的标签都会在同一条线上不停地运行。这对解析来说并不理想,尤其是如果我想逐行阅读的话。因此,这个过程的第一步是调用我的reformat_html函数,它将标记分隔成一行,并将段落放在一行上:

代码语言:javascript
复制
<tw-storydata name="Sample" startnode="1" creator="Twine" creator-version="2.0.8" ifid="1A382346-FBC1-411F-837E-BAB9EE2FB2E9" format="Harlowe" options="">
<style role="stylesheet" id="twine-user-stylesheet" type="text/twine-css">
</style>
<script role="script" id="twine-user-script" type="text/twine-javascript">
</script>
<tw-passagedata pid="1" name="Passage_A" tags="" position="197,62">
[[Passage B]]
[[Go to passage C|Passage C]]
</tw-passagedata>
<tw-passagedata pid="2" name="Passage_B" tags="tag-2" position="114,225">
This is passage B
[[Passage B]] 
[[Passage A]] 
</tw-passagedata>
<tw-passagedata pid="3" name="Passage_C" tags="tag-1 tag-2" position="314,225">
This passage goes nowhere.
</tw-passagedata>
</tw-storydata>

现在,我可以轻松地逐行读取它,解析起始标记中的键值对,解析与标记分离的段落文本,然后知道每个标记何时关闭。现在可以使用我的read_as_json函数将这个整理好的html读入json,产生如下结果:

代码语言:javascript
复制
{
    "style": {
        "type": "text/twine-css", 
        "role": "stylesheet", 
        "id": "twine-user-stylesheet"
    }, 
    "script": {
        "type": "text/twine-javascript", 
        "role": "script", 
        "id": "twine-user-script"
    }, 
    "tw-passagedata": [
        {
            "position": "197,62", 
            "text": "[[Passage B]]\n[[Go to passage C|Passage C]]\n", 
            "pid": "1", 
            "name": "Passage_A", 
            "tags": ""
        }, 
        {
            "position": "114,225", 
            "text": "This is passage B\n[[Passage B]] \n[[Passage A]] \n", 
            "pid": "2", 
            "name": "Passage_B", 
            "tags": "tag-2"
        }, 
        {
            "position": "314,225", 
            "text": "This passage goes nowhere.\n\n", 
            "pid": "3", 
            "name": "Passage_C", 
            "tags": "tag-1 tag-2"
        }
    ], 
    "tw-storydata": {
        "startnode": "1", 
        "name": "Sample", 
        "format": "Harlowe", 
        "creator": "Twine", 
        "creator-version": "2.0.8", 
        "ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9", 
        "options": ""
    }
}

显然,这是一个很小的例子,我还没有做任何实际的分析短文(即。(超链接或格式化),但我想得到一些关于我迄今所做工作的反馈。我的一些解析感觉很烦人,但我想不出真正优雅的方法来检查字符是否在引号之外。

另外,我以前确实将<>"字符作为常量,但是QUOTETAGCLOSETAG的名称感觉没有那么有意义,特别是当注释使上下文变得清晰时。

我特别想知道这是如何可读性和准确性。我以前还没有真正做过解析,所以我可能犯了一些天真的错误。我通常也不会编写需要被其他程序员使用或扩展的代码,所以让它们变得容易是很重要的。

代码语言:javascript
复制
from json import dump
from pprint import pprint


PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"
CLOSETAG_PARSE_ERROR = "Can't parse close tag in {}"


def write_passage(out_file, line):
    """Check how much of line is passage data and write it to out_file

    Returns what remains of the truncated line."""

    end_index = line.find('<')
    if end_index == -1:
        out_file.write(line)
        # Used up all the line as plain passage data.
        return ''
    else:
        # Need a newline so that the tag is separate from the passage data.
        out_file.write(line[:end_index] + '\n')
        return line[end_index:]


def next_quote(line, index):
    """Return the index of the next quote
    Catches a -1 result, not catching this causes infinite loops.
    Add 1 as that's needed for all future searches."""

    quote_index = line[index:].find('"')
    if quote_index == -1
        return 0
    return index + 1 + quote_index


def find_closing_tag(line):
    """Returns the index of the closing tag in line.

    Ensures that it doesn't return a > enclosed in quotes.
    This is because that may just be a character in a string value."""

    close_index = line.find('>')
    quote_index = line.find('"')

    # We need to ensure > isn't enclosed in quotes
    if quote_index != -1:
        # Keep searching until we find a valid closing tag
        while quote_index < close_index:
            quote_index = next_quote(line, quote_index)

            if quote_index > close_index:
                # Find the next > after "
                close_index = (quote_index +
                               line[quote_index:].find('>'))

            # Find the next quote that opens a keyvalue
            quote_index = next_quote(line, quote_index)
            if close_index == -1:
                raise ValueError(CLOSETAG_PARSE_ERROR.format(line))

    return close_index


def reformat_html(filepath):
    """Read Twine2's HTML format and write it out in a tidier format.

    Writes to the same directoy as filepath, just with _temp in the name.
    Returns the filepath of the resulting file."""

    output_file = filepath.replace('.html', '_temp.html')
    with open(filepath) as in_file, open(output_file, 'w') as out_file:
        for line in in_file:
            while line:
                # If it's a passage.
                if not line.startswith('<'):
                    line = write_passage(out_file, line)
                    continue

                close_index = find_closing_tag(line)
                out_file.write(line[:close_index + 1] + '\n')
                line = line[close_index + 1:]

    return output_file


def read_as_json(filepath):
    """Return a dictionary of data from the parsed file at filepath.

    Reads whether a line is a tag, tag closer or text from a passage.
    Close tags are ignored, tag data and passages are parsed into data."""

    data = {}
    with open(filepath) as f:
        for line in f:
            if line.startswith('</'):
                # Closing tag, nothing to see here.
                continue

            if line.startswith('<'):
                # New tag, parse it into data then go to the next line
                parse_tag(line, data)
                continue

            # Anything else is passage data
            # Concatenate it to the current passage node.
            data[PASSAGE_TAG][-1]['text'] += line

    return data


def separate_tags(tag):
    """Takes a tag string and returns the key name and a dictof tag values.

    Tags are strings in the format:
    <tagname key="value" key="another value">

    They're parsed by stripping the <>, then splitting off the tagname.
    Then the rest of the string is read and removed one by one.
    Space and " characters need to be checked to determine whether a space is
    a new keyvalue pair or part of the current value in quotation marks."""

    tagdata = {}
    tag = tag.strip().strip('<>')
    tagname, pairs = tag.split(' ', 1)

    # Makes each loop the same ie always seeking a space character
    pairs += ' '
    while pairs:
        # Find the second quotation mark
        quote_index = pairs.find('"')
        quote_index = 1 + pairs[quote_index + 1:].find('"')

        # If there's no quote found, just find the next space.
        if quote_index == -1:
            space_index = pairs.find(' ')
        # Otherwise find the space after the second quote
        else:
            space_index = quote_index + pairs[quote_index:].find(' ')

        # Add the keyvalue pair that's
        key, value = pairs[:space_index].split('=')
        tagdata[key] = value.strip('"')

        pairs = pairs[space_index + 1:]

    return tagname, tagdata


def parse_tag(tag, data):
    """Parse Twine tag into the data dictionary which is modified in place.

    The tag name is the key, it's value is a dictionary of the tag's key value
    pairs. Passage tags are stored in a list, as of now no other tag should
    be stored this way, and having multiple tags raises a ValueError."""

    tagname, tagdata = separate_tags(tag)
    if tagname == PASSAGE_TAG:
        # Create text string to be available for concatenating to later.
        tagdata['text'] = ''
        try:
            data[tagname].append(tagdata)
        except KeyError:
            data[tagname] = [tagdata]
    else:
        if tagname in data:
            raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
        data[tagname] = tagdata


if __name__ == "__main__":
    # Sample test
    inpath = r'Sample Data\TwineInput.html'
    outpath = r'Sample Data\FinalOutput.json'
    result = reformat_html(inpath)
    data = read_as_json(result)
    with open(outpath, 'w') as f:
        dump(data, f, indent=4)
EN

回答 1

Code Review用户

回答已采纳

发布于 2015-11-06 14:37:28

别再发明轮子了。要解析HTML/XML,请使用HTML/XML解析器。不管布局看起来多么棘手,只要格式良好的数据被输入到它们中,它们就应该处理它。这是他们的工作。

根据您的示例输入,我将假设twine生成格式良好的XML文件。因此,您可以摆脱自定义标记拆分/解析,并使用您选择的解析器。

例如,标准库附带了xml.etree.ElementTree。您可以使用它解析您的文件如下:

代码语言:javascript
复制
import xml.etree.ElementTree as ETree

inpath = r'Sample Data\TwineInput.html'
xml = ETree.parse(inpath)
for element in xml.getroot():
    print(element.tag, element.attrib)

其中的指纹:

代码语言:javascript
复制
style {'role': 'stylesheet', 'id': 'twine-user-stylesheet', 'type': 'text/twine-css'}
script {'role': 'script', 'id': 'twine-user-script', 'type': 'text/twine-javascript'}
tw-passagedata {'position': '197,62', 'name': 'Passage_A', 'pid': '1', 'tags': ''}
tw-passagedata {'position': '114,225', 'name': 'Passage_B', 'pid': '2', 'tags': 'tag-2'}
tw-passagedata {'position': '314,225', 'name': 'Passage_C', 'pid': '3', 'tags': 'tag-1 tag-2'}

很接近你要找的东西。

接下来要做的是处理多个tw-passagedata标记,添加一个text属性,处理根tw-storydata的情况,并可能使用MULTIPLE_TAG_ERROR消息处理重复标记:

代码语言:javascript
复制
import xml.etree.ElementTree as ETree
from json import dump

PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"

def parse_twine_tag(element, data):
    """Parse Twine tag into the data dictionary which is modified in place.

    The tag name is the key, it's value is a dictionary of the tag's key value
    pairs. Passage tags are stored in a list, as of now no other tag should
    be stored this way, and having multiple tags raises a ValueError.
    """

    tagname = element.tag
    attributes = element.attrib

    if tagname == PASSAGE_TAG:
        attributes['text'] = element.text
        data.setdefault(PASSAGE_TAG, []).append(attributes)
    elif tagname in data:
        raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
    else:
        data[tagname] = attributes

    for child in element:
        parse_twine_tag(child, data)

def parse_twine_file(filepath):
    """Return a dictionary of data from the parsed file at filepath"""

    xml = ETree.parse(filepath)
    data = dict()
    parse_twine_tag(xml.getroot(), data)
    return data

if __name__ == "__main__":
    # Sample test
    inpath = r'Sample Data\TwineInput.html'
    outpath = r'Sample Data\FinalOutput.json'

    data = parse_twine_file(inpath)
    with open(outpath, 'w') as f:
        dump(data, f, indent=4)

正如预期的那样,outpath包含:

代码语言:javascript
复制
{
    "style": {
        "role": "stylesheet", 
        "id": "twine-user-stylesheet", 
        "type": "text/twine-css"
    }, 
    "tw-passagedata": [
        {
            "position": "197,62", 
            "text": "[[Passage B]]\n[[Go to passage C|Passage C]]", 
            "name": "Passage_A", 
            "pid": "1", 
            "tags": ""
        }, 
        {
            "position": "114,225", 
            "text": "This is passage B\n[[Passage B]] \n[[Passage A]] ", 
            "name": "Passage_B", 
            "pid": "2", 
            "tags": "tag-2"
        }, 
        {
            "position": "314,225", 
            "text": "This passage goes nowhere.", 
            "name": "Passage_C", 
            "pid": "3", 
            "tags": "tag-1 tag-2"
        }
    ], 
    "script": {
        "role": "script", 
        "id": "twine-user-script", 
        "type": "text/twine-javascript"
    }, 
    "tw-storydata": {
        "startnode": "1", 
        "name": "Sample", 
        "creator-version": "2.0.8", 
        "ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9", 
        "format": "Harlowe", 
        "options": "", 
        "creator": "Twine"
    }
}
票数 2
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/109988

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档