对于那些不知道的人来说,麻绳只是一个简单的交互式小说制作工具。它可以让你轻松地创建一系列相互链接的段落,让你选择自己的冒险风格结构。它以HTML格式导出,但如果您只想使用T电平来编写节点以便在其他地方使用,则缺乏任何其他导出格式。我认为JSON是一种更有价值的格式,所以我决定制作这个解析器。
然而,源数据有点混乱,如下所示:
<tw-storydata name="Sample" startnode="1" creator="Twine" creator-version="2.0.8" ifid="1A382346-FBC1-411F-837E-BAB9EE2FB2E9" format="Harlowe" options=""><style role="stylesheet" id="twine-user-stylesheet" type="text/twine-css"></style><script role="script" id="twine-user-script" type="text/twine-javascript"></script><tw-passagedata pid="1" name="Passage_A" tags="" position="197,62">[[Passage B]]
[[Go to passage C|Passage C]]</tw-passagedata><tw-passagedata pid="2" name="Passage_B" tags="tag-2" position="114,225">This is passage B
[[Passage B]]
[[Passage A]] </tw-passagedata><tw-passagedata pid="3" name="Passage_C" tags="tag-1 tag-2" position="314,225">This passage goes nowhere.</tw-passagedata></tw-storydata>如果不清楚的话(因为一开始我不清楚),只有当实际的文本段落包含换行符时才会出现换行符。否则,所有的标签都会在同一条线上不停地运行。这对解析来说并不理想,尤其是如果我想逐行阅读的话。因此,这个过程的第一步是调用我的reformat_html函数,它将标记分隔成一行,并将段落放在一行上:
<tw-storydata name="Sample" startnode="1" creator="Twine" creator-version="2.0.8" ifid="1A382346-FBC1-411F-837E-BAB9EE2FB2E9" format="Harlowe" options="">
<style role="stylesheet" id="twine-user-stylesheet" type="text/twine-css">
</style>
<script role="script" id="twine-user-script" type="text/twine-javascript">
</script>
<tw-passagedata pid="1" name="Passage_A" tags="" position="197,62">
[[Passage B]]
[[Go to passage C|Passage C]]
</tw-passagedata>
<tw-passagedata pid="2" name="Passage_B" tags="tag-2" position="114,225">
This is passage B
[[Passage B]]
[[Passage A]]
</tw-passagedata>
<tw-passagedata pid="3" name="Passage_C" tags="tag-1 tag-2" position="314,225">
This passage goes nowhere.
</tw-passagedata>
</tw-storydata>现在,我可以轻松地逐行读取它,解析起始标记中的键值对,解析与标记分离的段落文本,然后知道每个标记何时关闭。现在可以使用我的read_as_json函数将这个整理好的html读入json,产生如下结果:
{
"style": {
"type": "text/twine-css",
"role": "stylesheet",
"id": "twine-user-stylesheet"
},
"script": {
"type": "text/twine-javascript",
"role": "script",
"id": "twine-user-script"
},
"tw-passagedata": [
{
"position": "197,62",
"text": "[[Passage B]]\n[[Go to passage C|Passage C]]\n",
"pid": "1",
"name": "Passage_A",
"tags": ""
},
{
"position": "114,225",
"text": "This is passage B\n[[Passage B]] \n[[Passage A]] \n",
"pid": "2",
"name": "Passage_B",
"tags": "tag-2"
},
{
"position": "314,225",
"text": "This passage goes nowhere.\n\n",
"pid": "3",
"name": "Passage_C",
"tags": "tag-1 tag-2"
}
],
"tw-storydata": {
"startnode": "1",
"name": "Sample",
"format": "Harlowe",
"creator": "Twine",
"creator-version": "2.0.8",
"ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9",
"options": ""
}
}显然,这是一个很小的例子,我还没有做任何实际的分析短文(即。(超链接或格式化),但我想得到一些关于我迄今所做工作的反馈。我的一些解析感觉很烦人,但我想不出真正优雅的方法来检查字符是否在引号之外。
另外,我以前确实将<、>和"字符作为常量,但是QUOTETAG和CLOSETAG的名称感觉没有那么有意义,特别是当注释使上下文变得清晰时。
我特别想知道这是如何可读性和准确性。我以前还没有真正做过解析,所以我可能犯了一些天真的错误。我通常也不会编写需要被其他程序员使用或扩展的代码,所以让它们变得容易是很重要的。
from json import dump
from pprint import pprint
PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"
CLOSETAG_PARSE_ERROR = "Can't parse close tag in {}"
def write_passage(out_file, line):
"""Check how much of line is passage data and write it to out_file
Returns what remains of the truncated line."""
end_index = line.find('<')
if end_index == -1:
out_file.write(line)
# Used up all the line as plain passage data.
return ''
else:
# Need a newline so that the tag is separate from the passage data.
out_file.write(line[:end_index] + '\n')
return line[end_index:]
def next_quote(line, index):
"""Return the index of the next quote
Catches a -1 result, not catching this causes infinite loops.
Add 1 as that's needed for all future searches."""
quote_index = line[index:].find('"')
if quote_index == -1
return 0
return index + 1 + quote_index
def find_closing_tag(line):
"""Returns the index of the closing tag in line.
Ensures that it doesn't return a > enclosed in quotes.
This is because that may just be a character in a string value."""
close_index = line.find('>')
quote_index = line.find('"')
# We need to ensure > isn't enclosed in quotes
if quote_index != -1:
# Keep searching until we find a valid closing tag
while quote_index < close_index:
quote_index = next_quote(line, quote_index)
if quote_index > close_index:
# Find the next > after "
close_index = (quote_index +
line[quote_index:].find('>'))
# Find the next quote that opens a keyvalue
quote_index = next_quote(line, quote_index)
if close_index == -1:
raise ValueError(CLOSETAG_PARSE_ERROR.format(line))
return close_index
def reformat_html(filepath):
"""Read Twine2's HTML format and write it out in a tidier format.
Writes to the same directoy as filepath, just with _temp in the name.
Returns the filepath of the resulting file."""
output_file = filepath.replace('.html', '_temp.html')
with open(filepath) as in_file, open(output_file, 'w') as out_file:
for line in in_file:
while line:
# If it's a passage.
if not line.startswith('<'):
line = write_passage(out_file, line)
continue
close_index = find_closing_tag(line)
out_file.write(line[:close_index + 1] + '\n')
line = line[close_index + 1:]
return output_file
def read_as_json(filepath):
"""Return a dictionary of data from the parsed file at filepath.
Reads whether a line is a tag, tag closer or text from a passage.
Close tags are ignored, tag data and passages are parsed into data."""
data = {}
with open(filepath) as f:
for line in f:
if line.startswith('</'):
# Closing tag, nothing to see here.
continue
if line.startswith('<'):
# New tag, parse it into data then go to the next line
parse_tag(line, data)
continue
# Anything else is passage data
# Concatenate it to the current passage node.
data[PASSAGE_TAG][-1]['text'] += line
return data
def separate_tags(tag):
"""Takes a tag string and returns the key name and a dictof tag values.
Tags are strings in the format:
<tagname key="value" key="another value">
They're parsed by stripping the <>, then splitting off the tagname.
Then the rest of the string is read and removed one by one.
Space and " characters need to be checked to determine whether a space is
a new keyvalue pair or part of the current value in quotation marks."""
tagdata = {}
tag = tag.strip().strip('<>')
tagname, pairs = tag.split(' ', 1)
# Makes each loop the same ie always seeking a space character
pairs += ' '
while pairs:
# Find the second quotation mark
quote_index = pairs.find('"')
quote_index = 1 + pairs[quote_index + 1:].find('"')
# If there's no quote found, just find the next space.
if quote_index == -1:
space_index = pairs.find(' ')
# Otherwise find the space after the second quote
else:
space_index = quote_index + pairs[quote_index:].find(' ')
# Add the keyvalue pair that's
key, value = pairs[:space_index].split('=')
tagdata[key] = value.strip('"')
pairs = pairs[space_index + 1:]
return tagname, tagdata
def parse_tag(tag, data):
"""Parse Twine tag into the data dictionary which is modified in place.
The tag name is the key, it's value is a dictionary of the tag's key value
pairs. Passage tags are stored in a list, as of now no other tag should
be stored this way, and having multiple tags raises a ValueError."""
tagname, tagdata = separate_tags(tag)
if tagname == PASSAGE_TAG:
# Create text string to be available for concatenating to later.
tagdata['text'] = ''
try:
data[tagname].append(tagdata)
except KeyError:
data[tagname] = [tagdata]
else:
if tagname in data:
raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
data[tagname] = tagdata
if __name__ == "__main__":
# Sample test
inpath = r'Sample Data\TwineInput.html'
outpath = r'Sample Data\FinalOutput.json'
result = reformat_html(inpath)
data = read_as_json(result)
with open(outpath, 'w') as f:
dump(data, f, indent=4)发布于 2015-11-06 14:37:28
别再发明轮子了。要解析HTML/XML,请使用HTML/XML解析器。不管布局看起来多么棘手,只要格式良好的数据被输入到它们中,它们就应该处理它。这是他们的工作。
根据您的示例输入,我将假设twine生成格式良好的XML文件。因此,您可以摆脱自定义标记拆分/解析,并使用您选择的解析器。
例如,标准库附带了xml.etree.ElementTree。您可以使用它解析您的文件如下:
import xml.etree.ElementTree as ETree
inpath = r'Sample Data\TwineInput.html'
xml = ETree.parse(inpath)
for element in xml.getroot():
print(element.tag, element.attrib)其中的指纹:
style {'role': 'stylesheet', 'id': 'twine-user-stylesheet', 'type': 'text/twine-css'}
script {'role': 'script', 'id': 'twine-user-script', 'type': 'text/twine-javascript'}
tw-passagedata {'position': '197,62', 'name': 'Passage_A', 'pid': '1', 'tags': ''}
tw-passagedata {'position': '114,225', 'name': 'Passage_B', 'pid': '2', 'tags': 'tag-2'}
tw-passagedata {'position': '314,225', 'name': 'Passage_C', 'pid': '3', 'tags': 'tag-1 tag-2'}很接近你要找的东西。
接下来要做的是处理多个tw-passagedata标记,添加一个text属性,处理根tw-storydata的情况,并可能使用MULTIPLE_TAG_ERROR消息处理重复标记:
import xml.etree.ElementTree as ETree
from json import dump
PASSAGE_TAG = "tw-passagedata"
MULTIPLE_TAG_ERROR = "Found multiple '{}' tags, not currently supported"
def parse_twine_tag(element, data):
"""Parse Twine tag into the data dictionary which is modified in place.
The tag name is the key, it's value is a dictionary of the tag's key value
pairs. Passage tags are stored in a list, as of now no other tag should
be stored this way, and having multiple tags raises a ValueError.
"""
tagname = element.tag
attributes = element.attrib
if tagname == PASSAGE_TAG:
attributes['text'] = element.text
data.setdefault(PASSAGE_TAG, []).append(attributes)
elif tagname in data:
raise ValueError(MULTIPLE_TAG_ERROR.format(tagname))
else:
data[tagname] = attributes
for child in element:
parse_twine_tag(child, data)
def parse_twine_file(filepath):
"""Return a dictionary of data from the parsed file at filepath"""
xml = ETree.parse(filepath)
data = dict()
parse_twine_tag(xml.getroot(), data)
return data
if __name__ == "__main__":
# Sample test
inpath = r'Sample Data\TwineInput.html'
outpath = r'Sample Data\FinalOutput.json'
data = parse_twine_file(inpath)
with open(outpath, 'w') as f:
dump(data, f, indent=4)正如预期的那样,outpath包含:
{
"style": {
"role": "stylesheet",
"id": "twine-user-stylesheet",
"type": "text/twine-css"
},
"tw-passagedata": [
{
"position": "197,62",
"text": "[[Passage B]]\n[[Go to passage C|Passage C]]",
"name": "Passage_A",
"pid": "1",
"tags": ""
},
{
"position": "114,225",
"text": "This is passage B\n[[Passage B]] \n[[Passage A]] ",
"name": "Passage_B",
"pid": "2",
"tags": "tag-2"
},
{
"position": "314,225",
"text": "This passage goes nowhere.",
"name": "Passage_C",
"pid": "3",
"tags": "tag-1 tag-2"
}
],
"script": {
"role": "script",
"id": "twine-user-script",
"type": "text/twine-javascript"
},
"tw-storydata": {
"startnode": "1",
"name": "Sample",
"creator-version": "2.0.8",
"ifid": "1A382346-FBC1-411F-837E-BAB9EE2FB2E9",
"format": "Harlowe",
"options": "",
"creator": "Twine"
}
}https://codereview.stackexchange.com/questions/109988
复制相似问题