我正在尝试使用pyparsing来扫描文本中的化学公式。我有以下示例代码:
from pyparsing import *
caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()
element = oneOf( """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No """ )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )
nreal = (Combine( integer + Optional( separator +\
Optional( integer ) ))\
| Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )
block = Forward()
groupElem = (Group( element('elem') + Optional( nreal, default=1)('esteq') ))('dupla') | \
Group( parl + block + parr + Optional( nreal,default=1 )('modi'))
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )+ Optional(Or([Literal("-"), Literal("+")]))
s = '''Water is H2O not h2o, methane is CH4 and of course there is PtCl4.
What about H+ and OH-? and carbon or Carbon or H2SO4?'''
for match, start, stop in formula.scanString(s):
print match, s[start:stop]并且它输出:
[['W', 1]] W
[['H', 2.0], ['O', 1]] H2O
[['C', 1], ['H', 4.0]] CH4
[['Pt', 1], ['Cl', 4.0], ['W', 1]] PtCl4.
W
[['H', 1], '+'] H+
[['O', 1], ['H', 1], '-'] OH-
[['Ca', 1]] Ca
[['H', 2.0], ['S', 1], ['O', 4.0]] H2SO4这大致是正确的,但也有一些错误的命中。例如,W和碳的Ca不应列出。我不确定如何修改语法以表明碳中的钙不是化学式。解析器只在公式上与parseString很好地工作,但在混合文本中不够具体。有任何关于如何修复它的提示吗?
发布于 2015-07-01 23:43:30
我认为您希望所有公式都是自包含的字母组,因此只需将您的公式定义更改为:
formula = (WordStart() +
OneOrMore( block )+ Optional(Or([Literal("-"), Literal("+")])) +
WordEnd())https://stackoverflow.com/questions/31162954
复制相似问题