文章/答案/技术大牛

发布

社区首页 >问答首页 >混合内容和字符串操作清理

问混合内容和字符串操作清理
EN

Stack Overflow用户

提问于 2012-10-02 12:23:33

回答 2查看 478关注 0票数 5

我正处于将基于文字的文档转换为XML的非常痛苦的过程中。我遇到了以下问题：

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
            quote</hi>?” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
            quote</hi>” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
            definitely a quote</hi>!” (Source). </p>

    <p>
        <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
            first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
            well</hi>!?” (Source). </p>

</root>

<p>节点具有混合内容。我在上一次迭代中处理过<element>。但现在的问题是引号和源部分出现在<hi rend= "italics"/>中，部分以文本节点的形式出现。

如何使用XSLT2.0：

匹配前面紧跟文本节点的所有<hi rend="italics">节点，该节点的最后一个字符是“”？
将<hi rend="italics">的内容输出为<quote>...</quote>，去掉引号(“”和“”)，但在<quote/>中包含任何出现在<hi rend="italics">兄弟姐妹之后的问号和感叹号
将<hi rend="italics">节点后面的"(“和")”之间的文本节点转换为不带括号的<source>...</source>。
包括最后的句号。

换句话说，我的输出应该如下所示：

<root>
<p>
<element>This one is taken care of.</element> Some more text. <quote>Is this a quote?</quote> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a quote</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is definitely a quote!</hi> <source>Source</source>.
</p>

<p>
<element>This one is taken care of.</element> Some more text. <quote>This is a first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as well!?</quote> <source>Source</source>. 
</p>

</root>

我从来没有处理过像这样的混合内容和字符串操作，整件事真的让我失望了。我会非常感激你的建议。

xml

xslt

xslt-2.0

回答 2

Stack Overflow用户

回答已采纳

发布于 2012-10-02 15:57:08

这是另一种解决办法。它允许更多的叙事风格的输入文档(引号中的引号，一个文本节点中的多个(Source)片段，“”作为数据，当后面没有hi元素时)。

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:so="http://stackoverflow.com/questions/12690177"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xsl xs so">
<xsl:output omit-xml-declaration="yes" indent="yes" />
<xsl:strip-space elements="*" />  

<xsl:template match="@*|comment()|processing-instruction()">
  <xsl:copy />
</xsl:template>

<xsl:template match="*">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
</xsl:template>

<xsl:function name="so:clip-start" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring($in-text,1,string-length($in-text)-1)" />
</xsl:function>

<xsl:function name="so:clip-end" as="xs:string">
  <xsl:param name="in-text" as="xs:string" />
  <xsl:value-of select="substring-after($in-text,'”')" />
</xsl:function>

<xsl:function name="so:matches-start" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/following-sibling::node()/self::hi[@rend='italics'] and
                        ends-with($text-node, '„')" />
</xsl:function>

<xsl:template match="text()[so:matches-start(.)]"    priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-start(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:function name="so:matches-end" as="xs:boolean">
  <xsl:param name="text-node" as="text()" />
  <xsl:value-of select="$text-node/preceding-sibling::node()/self::hi[@rend='italics'] and
                        matches($text-node,'^[!?]*”')" />
</xsl:function>

<xsl:template match="text()[so:matches-end(.)]"   priority="2">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()[so:matches-start(.)][so:matches-end(.)]" priority="3">
  <xsl:call-template name="parse-text">
   <xsl:with-param name="text" select="so:clip-end(so:clip-start(.))" />
  </xsl:call-template>
</xsl:template>

<xsl:template match="text()" name="parse-text" priority="1">
  <xsl:param name="text" select="." />
  <xsl:analyze-string select="$text" regex="\(([^)]*)\)">
    <xsl:matching-substring>
      <source>
        <xsl:value-of select="regex-group(1)" />
      </source>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:template>

<xsl:template match="hi[@rend='italics']">
  <quote>
    <xsl:apply-templates select="(@* except @rend) | node()" />
    <xsl:for-each select="following-sibling::node()[1]/self::text()[matches(.,'^[!?]')]">
      <xsl:value-of select="replace(., '^([!?]+).*$', '$1')" />
    </xsl:for-each>   
  </quote>
</xsl:template>

</xsl:stylesheet>

票数 1

Stack Overflow用户

发布于 2012-10-02 13:23:08

这个转换

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes"/>

 <xsl:template match="node()|@*">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match=
  "hi[@rend='italics'
     and
      preceding-sibling::node()[1][self::text()[ends-with(., '„')]]
      ]">

  <quote>
    <xsl:value-of select=
     "concat(.,
             if(matches(following-sibling::text()[1], '^[?!]+'))
              then replace(following-sibling::text()[1], '^([?!]+).*$', '$1')
              else()
             )
      "/>
  </quote>
 </xsl:template>

 <xsl:template match="text()[true()]">
  <xsl:variable name="vThis" select="."/>
  <xsl:variable name="vThis2" select="translate($vThis, '„”?!', '')"/>

  <xsl:value-of select="substring-before(concat($vThis2, '('), '(')"/>
  <xsl:if test="contains($vThis2, '(')">
   <source>
    <xsl:value-of select=
      "substring-before(substring-after($vThis2, '('), ')')"/>
   </source>
   <xsl:value-of select="substring-after($vThis2, ')')"/>
  </xsl:if>
 </xsl:template>
</xsl:stylesheet>

在提供的XML文档上应用时

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">Is this a
                quote</hi>?” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is a
                quote</hi>” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text. „<hi rend="italics">This is
                definitely a quote</hi>!” (Source). </p>

        <p>
            <element>This one is taken care of.</element> Some more text.„<hi rend="italics">This is a
                first quote</hi>” (Source). „<hi rend="italics">Sometimes there is a second quote as
                well</hi>!?” (Source). </p>

</root>

生成所需的、正确的结果

<root>
        <p>
            <element>This one is taken care of.</element> Some more text. <quote>Is this a
                quote?</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is a
                quote</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text. <quote>This is
                definitely a quote!</quote> <source>Source</source>. </p>

        <p>
            <element>This one is taken care of.</element> Some more text.<quote>This is a
                first quote</quote> <source>Source</source>. <quote>Sometimes there is a second quote as
                well!?</quote> <source>Source</source>. </p>

</root>

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/12690177

复制

相似问题

问混合内容和字符串操作清理
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问混合内容和字符串操作清理EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问混合内容和字符串操作清理
EN