概述
因此,我正在对一个项目进行重构,并分离出一堆解析代码。我所关注的代码是pyparsing。
即使在花了大量时间阅读正式文档之后,我对poor解析的理解还是很差的。我遇到了麻烦,因为(1) poor解析采用了一种(故意)非正统的解析方法,(2)我正在编写我没有编写的代码,注释很差,还有一组现有的语法。
(我也无法与原作者联系。)
不合格试验
我正在使用PyVows测试我的代码。我的一个测试如下(我认为这很清楚,即使您不熟悉PyVows;如果不是,请告诉我):
def test_multiline_command_ends(self, topic):
output = parsed_input('multiline command ends\n\n',topic)
expect(output).to_equal(
r'''['multiline', 'command ends', '\n', '\n']
- args: command ends
- multiline_command: multiline
- statement: ['multiline', 'command ends', '\n', '\n']
- args: command ends
- multiline_command: multiline
- terminator: ['\n', '\n']
- terminator: ['\n', '\n']''')但是,当我运行测试时,我在终端中得到以下信息:
失败的测试结果
Expected topic("['multiline', 'command ends']\n- args: command ends\n- command: multiline\n- statement: ['multiline', 'command ends']\n - args: command ends\n - command: multiline")
to equal "['multiline', 'command ends', '\\n', '\\n']\n- args: command ends\n- multiline_command: multiline\n- statement: ['multiline', 'command ends', '\\n', '\\n']\n - args: command ends\n - multiline_command: multiline\n - terminator: ['\\n', '\\n']\n- terminator: ['\\n', '\\n']"注:
由于输出是到终端,预期的输出(第二个)有额外的反斜杠。这很正常。在这段重构开始之前,测试运行时没有问题。
预期行为
输出的第一行应该与第二行匹配,但不匹配。特别是,它不包括第一个list对象中的两个换行符。
所以我明白了:
"['multiline', 'command ends']\n- args: command ends\n- command: multiline\n- statement: ['multiline', 'command ends']\n - args: command ends\n - command: multiline"当我时,应该得到这样的信息:
"['multiline', 'command ends', '\\n', '\\n']\n- args: command ends\n- multiline_command: multiline\n- statement: ['multiline', 'command ends', '\\n', '\\n']\n - args: command ends\n - multiline_command: multiline\n - terminator: ['\\n', '\\n']\n- terminator: ['\\n', '\\n']"在代码的前面,也有这样的语句:
pyparsing.ParserElement.setDefaultWhitespaceChars(' \t')…我认为应该防止这种错误。但我不确定。
即使不能确定问题的存在,缩小问题所在的范围也会有很大帮助。
请让我知道,我可能会采取一两步来解决这个问题。
编辑:所以,呃,我应该发布解析器代码,不是吗?(谢谢你的建议,安德鲁库克!)
解析器代码
下面是解析器对象的__init__。
我知道,,,这是个噩梦。这就是为什么我要重构这个项目。☺
def __init__(self, Cmd_object=None, *args, **kwargs):
# @NOTE
# This is one of the biggest pain points of the existing code.
# To aid in readability, I CAPITALIZED all variables that are
# not set on `self`.
#
# That means that CAPITALIZED variables aren't
# used outside of this method.
#
# Doing this has allowed me to more easily read what
# variables become a part of other variables during the
# building-up of the various parsers.
#
# I realize the capitalized variables is unorthodox
# and potentially anti-convention. But after reaching out
# to the project's creator several times over roughly 5
# months, I'm still working on this project alone...
# And without help, this is the only way I can move forward.
#
# I have a very poor understanding of the parser's
# control flow when the user types a command and hits ENTER,
# and until the author (or another pyparsing expert)
# explains what's happening to me, I have to do silly
# things like this. :-|
#
# Of course, if the impossible happens and this code
# gets cleaned up, then the variables will be restored to
# proper capitalization.
#
# —Zearin
# http://github.com/zearin/
# 2012 Mar 26
if Cmd_object is not None:
self.Cmd_object = Cmd_object
else:
raise Exception('Cmd_object be provided to Parser.__init__().')
# @FIXME
# Refactor methods into this class later
preparse = self.Cmd_object.preparse
postparse = self.Cmd_object.postparse
self._allow_blank_lines = False
self.abbrev = True # Recognize abbreviated commands
self.case_insensitive = True # Commands recognized regardless of case
# make sure your terminators are not in legal_chars!
self.legal_chars = u'!#$%.:?@_' + PYP.alphanums + PYP.alphas8bit
self.multiln_commands = [] if 'multiline_commands' not in kwargs else kwargs['multiln_commands']
self.no_special_parse = {'ed','edit','exit','set'}
self.redirector = '>' # for sending output to file
self.reserved_words = []
self.shortcuts = { '?' : 'help' ,
'!' : 'shell',
'@' : 'load' ,
'@@': '_relative_load'
}
# self._init_grammars()
#
# def _init_grammars(self):
# @FIXME
# Add Docstring
# ----------------------------
# Tell PYP how to parse
# file input from '< filename'
# ----------------------------
FILENAME = PYP.Word(self.legal_chars + '/\\')
INPUT_MARK = PYP.Literal('<')
INPUT_MARK.setParseAction(lambda x: '')
INPUT_FROM = FILENAME('INPUT_FROM')
INPUT_FROM.setParseAction( self.Cmd_object.replace_with_file_contents )
# ----------------------------
#OUTPUT_PARSER = (PYP.Literal('>>') | (PYP.WordStart() + '>') | PYP.Regex('[^=]>'))('output')
OUTPUT_PARSER = (PYP.Literal( 2 * self.redirector) | \
(PYP.WordStart() + self.redirector) | \
PYP.Regex('[^=]' + self.redirector))('output')
PIPE = PYP.Keyword('|', identChars='|')
STRING_END = PYP.stringEnd ^ '\nEOF'
TERMINATORS = [';']
TERMINATOR_PARSER = PYP.Or([
(hasattr(t, 'parseString') and t)
or
PYP.Literal(t) for t in TERMINATORS
])('terminator')
self.comment_grammars = PYP.Or([ PYP.pythonStyleComment,
PYP.cStyleComment ])
self.comment_grammars.ignore(PYP.quotedString)
self.comment_grammars.setParseAction(lambda x: '')
self.comment_grammars.addParseAction(lambda x: '')
self.comment_in_progress = '/*' + PYP.SkipTo(PYP.stringEnd ^ '*/')
# QuickRef: Pyparsing Operators
# ----------------------------
# ~ creates NotAny using the expression after the operator
#
# + creates And using the expressions before and after the operator
#
# | creates MatchFirst (first left-to-right match) using the
# expressions before and after the operator
#
# ^ creates Or (longest match) using the expressions before and
# after the operator
#
# & creates Each using the expressions before and after the operator
#
# * creates And by multiplying the expression by the integer operand;
# if expression is multiplied by a 2-tuple, creates an And of
# (min,max) expressions (similar to "{min,max}" form in
# regular expressions); if min is None, intepret as (0,max);
# if max is None, interpret as expr*min + ZeroOrMore(expr)
#
# - like + but with no backup and retry of alternatives
#
# * repetition of expression
#
# == matching expression to string; returns True if the string
# matches the given expression
#
# << inserts the expression following the operator as the body of the
# Forward expression before the operator
# ----------------------------
DO_NOT_PARSE = self.comment_grammars | \
self.comment_in_progress | \
PYP.quotedString
# moved here from class-level variable
self.URLRE = re.compile('(https?://[-\\w\\./]+)')
self.keywords = self.reserved_words + [fname[3:] for fname in dir( self.Cmd_object ) if fname.startswith('do_')]
# not to be confused with `multiln_parser` (below)
self.multiln_command = PYP.Or([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')
ONELN_COMMAND = ( ~self.multiln_command +
PYP.Word(self.legal_chars)
)('command')
#self.multiln_command.setDebug(True)
# Configure according to `allow_blank_lines` setting
if self._allow_blank_lines:
self.blankln_termination_parser = PYP.NoMatch
else:
BLANKLN_TERMINATOR = (2 * PYP.lineEnd)('terminator')
#BLANKLN_TERMINATOR('terminator')
self.blankln_termination_parser = (
(self.multiln_command ^ ONELN_COMMAND)
+ PYP.SkipTo(
BLANKLN_TERMINATOR,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('args')
+ BLANKLN_TERMINATOR
)('statement')
# CASE SENSITIVITY for
# ONELN_COMMAND and self.multiln_command
if self.case_insensitive:
# Set parsers to account for case insensitivity (if appropriate)
self.multiln_command.setParseAction(lambda x: x[0].lower())
ONELN_COMMAND.setParseAction(lambda x: x[0].lower())
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)^'*')('idx')
+ PYP.Optional(PYP.Word(self.legal_chars + '/\\'))('fname')
+ PYP.stringEnd)
AFTER_ELEMENTS = PYP.Optional(PIPE +
PYP.SkipTo(
OUTPUT_PARSER ^ STRING_END,
ignore=DO_NOT_PARSE
)('pipeTo')
) + \
PYP.Optional(OUTPUT_PARSER +
PYP.SkipTo(
STRING_END,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('outputTo')
)
self.multiln_parser = (((self.multiln_command ^ ONELN_COMMAND)
+ PYP.SkipTo(
TERMINATOR_PARSER,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('args')
+ TERMINATOR_PARSER)('statement')
+ PYP.SkipTo(
OUTPUT_PARSER ^ PIPE ^ STRING_END,
ignore=DO_NOT_PARSE
).setParseAction(lambda x: x[0].strip())('suffix')
+ AFTER_ELEMENTS
)
#self.multiln_parser.setDebug(True)
self.multiln_parser.ignore(self.comment_in_progress)
self.singleln_parser = (
( ONELN_COMMAND + PYP.SkipTo(
TERMINATOR_PARSER
^ STRING_END
^ PIPE
^ OUTPUT_PARSER,
ignore=DO_NOT_PARSE
).setParseAction(lambda x:x[0].strip())('args'))('statement')
+ PYP.Optional(TERMINATOR_PARSER)
+ AFTER_ELEMENTS)
#self.multiln_parser = self.multiln_parser('multiln_parser')
#self.singleln_parser = self.singleln_parser('singleln_parser')
self.prefix_parser = PYP.Empty()
self.parser = self.prefix_parser + (STRING_END |
self.multiln_parser |
self.singleln_parser |
self.blankln_termination_parser |
self.multiln_command +
PYP.SkipTo(
STRING_END,
ignore=DO_NOT_PARSE)
)
self.parser.ignore(self.comment_grammars)
# a not-entirely-satisfactory way of distinguishing
# '<' as in "import from" from
# '<' as in "lesser than"
self.input_parser = INPUT_MARK + \
PYP.Optional(INPUT_FROM) + \
PYP.Optional('>') + \
PYP.Optional(FILENAME) + \
(PYP.stringEnd | '|')
self.input_parser.ignore(self.comment_in_progress)发布于 2012-04-11 18:39:50
我修好了!
皮斯分析不是错的!
我曾经是。☹
通过将解析代码分离成一个不同的对象,我创建了这个问题。最初,用于根据第二个属性的内容“更新自身”的属性。因为这一切过去都包含在一个“上帝类”中,所以效果很好。
简单地将代码分离成另一个对象,第一个属性被设置为实例化,但是如果它所依赖的第二个属性改变了,它就不再“更新”自己了。
具体情况
属性multiln_command (不要与multiln_commands-aargh混淆,多么混乱的命名!)是一个语法定义。如果multiln_command更改了,multiln_commands属性应该已经更新了它的语法。
虽然我知道这两个属性有着相似的名称,但目的却非常不同,但这种相似性无疑使追踪问题变得更加困难。我没有将multiln_command重命名为multiln_grammar。
但是!☺
我很感谢保罗·麦圭尔(Paul)的精彩回答,我希望它能在未来省却我(和其他人)的一些悲伤。虽然我感到有点愚蠢,因为我造成了这个问题(并把它误认为是一个语法分析问题),但我很高兴(以Paul的建议的形式)提出了这个问题。
各位,解析快乐。:)
发布于 2012-04-11 03:53:53
我怀疑这个问题是problem的内置空格跳过,默认情况下它会跳过换行符。尽管setDefaultWhitespaceChars被用来告诉though解析换行符是重要的,但是这个设置只影响在调用setDefaultWhitespaceChars之后创建的所有表达式。问题是,在导入时,problem解析试图通过定义许多方便的表达式来提供帮助,比如empty for Empty(),lineEnd for LineEnd()等等。但是由于这些都是在导入时创建的,所以它们是用原始的默认空格字符定义的,其中包括'\n'。
我也许应该在setDefaultWhitespaceChars中这样做,但你也可以自己清理。在调用setDefaultWhitespaceChars之后,重新定义这些模块级表达式:
PYP.ParserElement.setDefaultWhitespaceChars(' \t')
# redefine module-level constants to use new default whitespace chars
PYP.empty = PYP.Empty()
PYP.lineEnd = PYP.LineEnd()
PYP.stringEnd = PYP.StringEnd()我认为这将有助于恢复嵌入的新行的重要性。
解析器代码中的其他部分:
self.blankln_termination_parser = PYP.NoMatch 应该是
self.blankln_termination_parser = PYP.NoMatch() 您的原始作者可能在使用“^”over“\”时过于咄咄逼人。如果有可能意外地解析一个表达式,则只使用'^‘,而您实际上已经在后面的选项列表中解析了一个较长的表达式。例如,在:
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)^'*')('idx') 在数字数字字和单独的'*'之间不可能有混淆。Or (或'^'操作符)告诉pyparsing尝试对所有的选项进行评估,然后选择最长的匹配项--如果是领带,则选择列表中最左边的选项。如果解析'*',则不需要查看它是否也与较长的整数匹配,或者如果解析一个整数,则不需要查看它是否也可以作为单独的'*'传递。因此,将此更改为:
self.save_parser = ( PYP.Optional(PYP.Word(PYP.nums)|'*')('idx') 使用解析操作将字符串替换为“”更简单地使用PYP.Suppress包装器编写,或者如果您愿意的话,调用返回Suppress(expr)的expr.suppress()。加上对“\”而不是“^”的偏好,如下所示:
self.comment_grammars = PYP.Or([ PYP.pythonStyleComment,
PYP.cStyleComment ])
self.comment_grammars.ignore(PYP.quotedString)
self.comment_grammars.setParseAction(lambda x: '') 变得:
self.comment_grammars = (PYP.pythonStyleComment | PYP.cStyleComment
).ignore(PYP.quotedString).suppress()关键字具有内置逻辑以自动避免歧义,因此或完全没有必要:
self.multiln_command = PYP.Or([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command') 应:
self.multiln_command = PYP.MatchFirst([
PYP.Keyword(c, caseless=self.case_insensitive)
for c in self.multiln_commands
])('multiline_command')(在下一个版本中,我将放松那些初始化器以接受生成器表达式,这样就没有必要使用[]了。)
我现在只能看到这些。希望这能有所帮助。
https://stackoverflow.com/questions/10095299
复制相似问题