首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >不尝试实现递归解析器

不尝试实现递归解析器
EN

Stack Overflow用户
提问于 2013-10-12 02:29:18
回答 2查看 260关注 0票数 1

我正试图解析Wikipedia的XML转储,以便使用Haskell库在每个页面上找到某些链接。链接由两个括号表示:texttext[[link]]texttext。为了尽可能简化这个场景,假设我正在寻找第一个链接,它不包含在双大括号中(可以嵌套):{{ {{ [[Wrong Link]] }} [[Wrong Link]] }} [[Right Link]]。我编写了一个解析器,用于丢弃包含在非嵌套双大括号中的链接:

代码语言:javascript
复制
import Text.Parsec

getLink :: String -> Either ParseError String
getLink = parse linkParser "Links"

linkParser = do
    beforeLink
    link <- many $ noneOf "]"
    string "]]"
    return link

beforeLink = manyTill (many notLink) (try $ string "[[")

notLink = try doubleCurlyBrac <|> (many1 normalText)

normalText = noneOf "[{"
           <|> notFollowedByItself '['
           <|> notFollowedByItself '{'

notFollowedByItself c = try ( do x <- char c
                                 notFollowedBy $ char c
                                 return x)

doubleCurlyBrac = between (string "{{") (string "}}") (many $ noneOf "}")

getLinkTest = fmap getLink testList
    where testList = ["   [[rightLink]]   "                            --Correct link is found
                     , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"   --Correct link is found
                     , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]" ] --Wrong link is found 

我已经尝试让doubleCurlyBrac解析器递归地删除嵌套大括号中的链接,但没有成功:

代码语言:javascript
复制
doubleCurlyBrac = between (string "{{") (string "}}") betweenBraces
        where betweenBraces = doubleCurlyBrac <|> (many $ try $ noneOf "}")

在嵌套的示例中,这个解析器在第一个}}之后而不是最后一个}}之后停止使用输入。是否有一种优雅的方法来编写递归解析器(在本例中)正确忽略嵌套双大括号中的链接?而且,可以不使用try来完成吗?我已经发现,由于try不使用输入,它常常导致解析器挂起意外的、格式错误的输入。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-10-12 08:04:04

这里有一个更直接的版本,它不使用自定义的lexer。不过,它确实使用了try,在这里我不知道如何避免它。问题是,我们似乎需要一个不提交的前瞻性来区分双括号和单括号;try代表的是不提交前瞻性。

高级别方法与my first answer中的方法相同。我非常小心地让三个节点解析器通勤--通过使用trynotFollowedBy,使代码对更改更加健壮。

代码语言:javascript
复制
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (string)
import qualified Text.Parsec
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)

import Debug.Trace

----------------------------------------------------------------------
-- Token parsers.

llink , rlink , lbrace , rbrace :: Parsec String u String
[llink , rlink , lbrace , rbrace] = reserved
reserved = map (try . Text.Parsec.string) ["[[" , "]]" , "{{" , "}}"]

----------------------------------------------------------------------
-- Node parsers.

-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show

nodes :: Parsec String u [Node]
nodes = many node

node :: Parsec String u Node
node = link <|> braces <|> string

link , braces , string :: Parsec String u Node
link   = L <$> between llink  rlink  nodes
braces = B <$> between lbrace rbrace nodes
string = S <$> many1 (notFollowedBy (choice reserved) >> anyChar)

----------------------------------------------------------------------

parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>"

----------------------------------------------------------------------
-- Tests.

getLink :: [Node] -> Maybe Node
getLink = find isLink where
  isLink (L _) = True
  isLink _     = False

parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes

testList = [ "   [[rightLink]]   "
           , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"
           , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]"
           , " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
           -- Pathalogical example from comments.
           , "{{ab}cd}}"
           -- A more pathalogical example.
           , "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf" 
           -- No top level link.
           , "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
           -- Too many '{{'.
           , "{{ {{ {{ [[ asdf ]] }} }}"
           -- Too many '}}'.
           , "{{ {{ [[ asdf ]] }} }} }}"
           -- Too many '[['.
           , "[[ {{ [[{{[[asdf]]}}]]}}"
           ]

main =
  forM_ testList $ \ t -> do
  putStrLn $ "Test: ^" ++ t ++ "$"
  let parses = ( , ) <$> parseNodes t <*> parseLink t
      printParses (n , l) = do
        putStrLn $ "Nodes: " ++ show n
        putStrLn $ "Link: " ++ show l
      printError = putStrLn . show
  either printError printParses parses
  putStrLn ""

在非错误情况下,输出是相同的:

代码语言:javascript
复制
Test: ^   [[rightLink]]   $
Nodes: [S "   ",L [S "rightLink"],S "   "]
Link: Just (L [S "rightLink"])

Test: ^  {{    [[Wrong_Link]]    }}  [[rightLink]]$
Nodes: [S "  ",B [S "    ",L [S "Wrong_Link"],S "    "],S "  ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S "  ",B [S "  ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])

Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing

Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])

Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing

但是,在不匹配的空缺情况下,解析错误消息的信息量不高:

代码语言:javascript
复制
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"

Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 26):
unexpected "}}"

Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"

我也想不出怎么解决它们。

票数 2
EN

Stack Overflow用户

发布于 2013-10-12 07:28:50

我的解决方案不使用try,但相对比较复杂:我以您的问题为借口,学习如何在Parsec中不使用makeTokenParser :D创建一个lexer :d我避免使用try,因为唯一的展望发生在lexer (tokenize)中,其中标识了各种括号对。

高级思想是将{{}}[[]]视为特殊的标记,并将输入解析为AST。您没有精确地指定语法,所以我选择了一个生成示例的简单语法:

代码语言:javascript
复制
node ::= '{{' node* '}}'
       | '[[' node* ']]'
       | string
string ::= <non-empty string without '{{', '}}', '[[', or ']]'>

我将输入字符串解析为节点列表。第一个顶级链接([[)节点(如果有的话)是您要寻找的链接。

我所采用的方法对于语法的变化应该是相对健壮的。例如,如果希望仅允许链接中的字符串,则将'[[' node* ']]'更改为'[[' string ']]'。(在代码中)

代码语言:javascript
复制
link = L <$> between llink  rlink  nodes

变成了

代码语言:javascript
复制
link = L <$> between llink  rlink  string

)。

代码相当长,但大多很简单。其中大部分涉及创建令牌流(lexing)和解析单个令牌。在此之后,实际的Node解析非常简单。

下面是:

代码语言:javascript
复制
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (char , string)
import Text.Parsec.Pos (updatePosString , updatePosChar)
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)

----------------------------------------------------------------------
-- Lexing.

-- Character or punctuation.
data Token = C Char | P String deriving Eq

instance Show Token where
  show (C c) = [c]
  show (P s) = s

tokenize :: String -> [Token]
tokenize [] = []
tokenize [c] = [C c]
tokenize (c1:c2:cs) = case [c1,c2] of
  "[[" -> ts
  "]]" -> ts
  "{{" -> ts
  "}}" -> ts
  _    -> C c1 : tokenize (c2:cs)
  where
    ts = P [c1,c2] : tokenize cs

----------------------------------------------------------------------
-- Token parsers.

-- We update the 'sourcePos' while parsing the tokens.  Alternatively,
-- we could have annotated the tokens with positions above in
-- 'tokenize', and then here we would use 'token' instead of
-- 'tokenPrim'.
llink , rlink , lbrace , rbrace :: Parsec [Token] u Token
[llink , rlink , lbrace , rbrace] =
  map (t . P) ["[[" , "]]" , "{{" , "}}"]
  where
    t x = tokenPrim show update match where
      match y = if x == y then Just x else Nothing
      update pos (P s) _ = updatePosString pos s

char :: Parsec [Token] u Char
char = tokenPrim show update match where
  match (C c) = Just c
  match (P _) = Nothing
  update pos (C c) _ = updatePosChar pos c

----------------------------------------------------------------------
-- Node parsers.

-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show

nodes :: Parsec [Token] u [Node]
nodes = many node

node :: Parsec [Token] u Node
node = link <|> braces <|> string

link , braces , string :: Parsec [Token] u Node
link   = L <$> between llink  (rlink <?> "]]")  nodes
braces = B <$> between lbrace (rbrace <?> "}}") nodes
string = S <$> many1 char

----------------------------------------------------------------------

parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>" . tokenize

----------------------------------------------------------------------
-- Tests.

getLink :: [Node] -> Maybe Node
getLink = find isLink where
  isLink (L _) = True
  isLink _     = False

parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes

testList = [ "   [[rightLink]]   "
           , "  {{    [[Wrong_Link]]    }}  [[rightLink]]"
           , "  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]"
           , " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
           -- Pathalogical example from comments.
           , "{{ab}cd}}"
           -- A more pathalogical example.
           , "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf" 
           -- No top level link.
           , "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
           -- Too many '{{'.
           , "{{ {{ {{ [[ asdf ]] }} }}"
           -- Too many '}}'.
           , "{{ {{ [[ asdf ]] }} }} }}"
           -- Too many '[['.
           , "[[ {{ [[{{[[asdf]]}}]]}}"
           ]

main =
  forM_ testList $ \ t -> do
  putStrLn $ "Test: ^" ++ t ++ "$"
  let parses = ( , ) <$> parseNodes t <*> parseLink t
      printParses (n , l) = do
        putStrLn $ "Nodes: " ++ show n
        putStrLn $ "Link: " ++ show l
      printError = putStrLn . show
  either printError printParses parses
  putStrLn ""

main的输出是:

代码语言:javascript
复制
Test: ^   [[rightLink]]   $
Nodes: [S "   ",L [S "rightLink"],S "   "]
Link: Just (L [S "rightLink"])

Test: ^  {{    [[Wrong_Link]]    }}  [[rightLink]]$
Nodes: [S "  ",B [S "    ",L [S "Wrong_Link"],S "    "],S "  ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^  {{  {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S "  ",B [S "  ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])

Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])

Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing

Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])

Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing

Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting }}

Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 24):
unexpected }}
expecting end of input

Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting ]]
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/19329829

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档