首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >基于字符python合并xml中的行

基于字符python合并xml中的行
EN

Stack Overflow用户
提问于 2020-03-03 12:21:50
回答 1查看 47关注 0票数 0

我有多个xml文件,这些是一个PDF文档的xml版本。首先,我必须合并xml文件,然后读取以连字符结尾的单词。如果一个单词以连字符结尾,则在XML中创建了一个单独的标记(TCL CHAR = '-'),我需要识别这些标记,并将前一行的最后一个单词和下一行的第一个单词合并到一个单独的标记中,称为。我有以下用于合并的代码

代码语言:javascript
复制
def run(files):
first = None
for filename in files:
    data = ET.parse(filename).getroot()
    if first is None:
        first = data
    else:
        first.extend(data)
if first is not None:
    root = ET.tostring(first)
return root

和下面的word合并代码

beg_line_cont = [] end_line_cont = [] for block in root: for para in block: for line in para: for word in line: if word.tag == 'TC': line = word.text if word.tag == 'TCL' and word.attrib['CHAR']=='-': beg_line_cont.append(line) if word.tag == 'TC': line = word.text end_line_cont.append(line)

合并代码不起作用,我可以转到TCL CHAR = '-‘之前的前一行,但不能转到下一行...有人能帮忙吗??

XML文件示例位于以下位置:

代码语言:javascript
复制
</PAR>
<LPAR PBDPL="[D]137[L]120" PBCAMGTI="[G]LP6[T]Lead VJ" STRIKE="0"></LPAR>
<PAR PBDPL="[D]3360[P]3m" PBCAMGTI="[G]I2AS[I]0" TAPARADV="[HYP]1" BLMODE="3" STRIKE="0" UNIQID="d180d82ee84ff937">
<LINE>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Diese Angebotsunterlage (die &#132;</TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Angebotsunterlage</TC>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>&#147;) beschreibt das freiwillige &#246;ffentliche &#220;bernahme</TC>
<TCL CHAR="-" WIDTH="67" CTLCHAR="-" CTLSTR="" TYPE="SYSTEMHYPHEN" VISIBLE="1" USE_SF_LDRVALUES="1"/></LINE>
<LINE>
<TC>angebot in Form eines Tauschangebots (das &#132;</TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>Angebot</TC>
<FRMDEF NAME="BOLD" PNTSZSTR="" FONTNAME="" FACE="B" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>&#147;) der ADO Properties S.A., einer Aktiengesell</TC>
<TCL CHAR="-" WIDTH="67" CTLCHAR="-" CTLSTR="" TYPE="SYSTEMHYPHEN" VISIBLE="1" USE_SF_LDRVALUES="1"/></LINE>
<LINE>
<TC>schaft nach luxemburgischem Recht </TC>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ITALIC" PNTSZSTR="" FONTNAME="" FACE="I" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC>(soci&#233;t&#233; anonyme)</TC>
<FRMDEF NAME="ITALIC" PNTSZSTR="" FONTNAME="" FACE="I" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="1" SAVFRM="0" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<FRMDEF NAME="ROMAN" PNTSZSTR="" FONTNAME="" FACE="R" SETWIDTHSTR="" SLANTSTR="" BASESTR="" COLORSTR="" SCREENSTR="" SMALLCAPS="2" ALLCAPS="2" KNOCKOUT="2" ENDFRM="0" SAVFRM="1" UNDLEAD1="" UNDLEAD2="" UNDTHICK1="" UNDTHICK2="" UNDCOLOR="" UNDSCREEN="" UNDLKNOCKOUT="1"/>
<TC> mit Sitz in Senningerberg, eingetragen im </TC>
</LINE>
<LINE>
EN

回答 1

Stack Overflow用户

发布于 2020-03-03 12:49:27

当您遇到连字符时,您需要:

  1. 删除上一行中的最后一个单词。
  2. 开始换行。
  3. 将下一行中的第一个单词复制到新行中,前面不留空格。

这个想法是为新行保留一个“前缀”变量。例如,如果您有blah-<newline>blahblah,您将看到"blah",将其设置为下一行的前缀,当看到"blahblah“时,将其连接到"blah”。

代码语言:javascript
复制
text = prefix = ''
for block in root:
    for para in block:
        for line in para:
            for word in line:
                if word.tag == 'TC':
                    text += prefix + word.text
                    prefix = ''
                if word.tag == 'TCL' and word.attrib['CHAR']=='-':
                    # Find the last word
                    last_space_index = text.rfind(' ')
                    prefix = text[last_space_index + 1:]
                    text = text[:last_space_index]
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/60500024

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档