我有一组500-600个文件,我想通过搜索和提取数据。我正在尝试使用pyparsing,但收效甚微。在一个文件中只有3件事(1)注释,(2)简单赋值和(3)嵌套赋值。嵌套大约有6层深。
我的目标是查看3级深度字段中的特定值,如果它有特定值,则从属于同一2级字段的另一个3级字段中提取一个值。
首先,pyparsing是做这件事的合适工具吗?其他建议,如果不是呢?
我知道如何构建一个文件列表并遍历它们。让我展示一个示例文件,然后显示我正在尝试的代码。
# TOP_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TOP_OBJECT=
(
obj_fmt=
(
obj_name="foo"
obj_cre_date=737785182 # = Tue May 18 23:19:42 1993
opj_data=
(
a="continue"
b="quit"
)
obj_version=264192 # = Version 4.8.0
)
# LEVEL1_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL1_OBJECT=
(
OBJ_part=
(
obj_type=1005
obj_size=120
)
# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL2_OBJECT_A=
(
OBJ_part=
(
obj_type=3001
obj_size=128
)
Another_part=
(
another_attr=
(
another_style=0
another_param=2
)
)
) ### End of LEVEL2_OBJECT_A ###
# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL2_OBJECT_B=
(
OBJ_part=
(
obj_type=3005
obj_size=128
)
Another_part=
(
another_attr=
(
another_style=0
another_param=8
)
)
) ### End of LEVEL2_OBJECT_B ###
) ### End of LEVEL1 OBJECT
) ### End of TOP_OBJECT ###我处理该文件的代码如下所示:
from pyparsing import *
def Syntax():
comment = Group("#" + restOfLine).suppress()
eq = Literal('=')
lpar = Literal( '(' ).suppress()
rpar = Literal( ')' ).suppress()
num = Word(nums)
var = Word(alphas + "_")
simpleAssign = var + eq
nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar)
expr = Forward()
atom = nestedAssign | simpleAssign
expr << atom
expr.ignore(comment)
return expr
def main():
expr = Syntax()
results = expr.parseFile( "for_show.asc" )
print results
if __name__ == '__main__':
main()我的结果没有下降:'TOP_OBJECT','=‘
现在,我不是在处理带引号的字符串或数字,而是试图理解解析嵌套列表。
发布于 2013-01-06 05:23:28
大多数情况下,解析器中只有几个空隙-请参阅注释掉的原始代码,与当前代码相比:
def Syntax():
comment = Group("#" + restOfLine).suppress()
eq = Literal('=')
lpar = Literal( '(' ).suppress()
rpar = Literal( ')' ).suppress()
num = Word(nums)
#~ var = Word(alphas + "_")
var = Word(alphas + "_", alphanums+"_")
#~ simpleAssign = var + eq
expr = Forward()
simpleAssign = var + eq + (num | quotedString)
#~ nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar)
nestedAssign = var + eq + Group(lpar + OneOrMore(expr) + rpar)
atom = nestedAssign | simpleAssign
expr << atom
expr.ignore(comment)
return expr这提供了:
['TOP_OBJECT',
'=',
['obj_fmt',
'=',
['obj_name',
'=',
'"foo"',
'obj_cre_date',
'=',
'737785182',
'opj_data',
'=',
['a', '=', '"continue"', 'b', '=', '"quit"'],
'obj_version',
'=',
'264192'],
'LEVEL1_OBJECT',
'=',
['OBJ_part',
'=',
['obj_type', '=', '1005', 'obj_size', '=', '120'],
'LEVEL2_OBJECT_A',
'=',
['OBJ_part',
'=',
['obj_type', '=', '3001', 'obj_size', '=', '128'],
'Another_part',
'=',
['another_attr',
'=',
['another_style', '=', '0', 'another_param', '=', '2']]],
'LEVEL2_OBJECT_B',
'=',
['OBJ_part',
'=',
['obj_type', '=', '3005', 'obj_size', '=', '128'],
'Another_part',
'=',
['another_attr',
'=',
['another_style', '=', '0', 'another_param', '=', '8']]]]]]如果您将expr包装在nestedAssign的OneOrMore和Group中
nestedAssign = var + eq + Group(lpar + OneOrMore(Group(expr)) + rpar),我认为你的重复嵌套赋值会得到更好的结构:
['TOP_OBJECT',
'=',
[['obj_fmt',
'=',
[['obj_name', '=', '"foo"'],
['obj_cre_date', '=', '737785182'],
['opj_data', '=', [['a', '=', '"continue"'], ['b', '=', '"quit"']]],
['obj_version', '=', '264192']]],
['LEVEL1_OBJECT',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '1005'], ['obj_size', '=', '120']]],
['LEVEL2_OBJECT_A',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '3001'], ['obj_size', '=', '128']]],
['Another_part',
'=',
[['another_attr',
'=',
[['another_style', '=', '0'], ['another_param', '=', '2']]]]]]],
['LEVEL2_OBJECT_B',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '3005'], ['obj_size', '=', '128']]],
['Another_part',
'=',
[['another_attr',
'=',
[['another_style', '=', '0'], ['another_param', '=', '8']]]]]]]]]]]另外,你最初发布的代码包含制表符,我发现它们比它们的价值更麻烦,最好使用4个空格的缩进。
https://stackoverflow.com/questions/14174273
复制相似问题