首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用Raku (以前称为Perl6)从.bib文件中提取

使用Raku (以前称为Perl6)从.bib文件中提取
EN

Stack Overflow用户
提问于 2017-07-19 00:10:15
回答 1查看 1.7K关注 0票数 21

我有这个

.bib文件

在用LaTeX写我的论文时,参考资料管理:

代码语言:javascript
复制
@article{garg2017patch,
  title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
  author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
  journal={Journal of Cosmetic Dermatology},
  year={2017},
  publisher={Wiley Online Library}
}

@article{hauso2008neuroendocrine,
  title={Neuroendocrine tumor epidemiology},
  author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
  journal={Cancer},
  volume={113},
  number={10},
  pages={2655--2664},
  year={2008},
  publisher={Wiley Online Library}
}

@article{siperstein1997laparoscopic,
  title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
  author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
  journal={Surgery},
  volume={122},
  number={6},
  pages={1147--1155},
  year={1997},
  publisher={Elsevier}
}

如果任何人想知道什么是bib文件,您可以找到它的详细信息

这里

..。

我想用Perl 6解析它来提取关键字和标题,如下所示:

代码语言:javascript
复制
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study

hauso2008neuroendocrine: Neuroendocrine tumor epidemiology

siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases

你能帮我做到这一点吗,也许可以用两种方式:

使用basic Perl 6

使用Perl 6语法

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-07-19 13:12:05

此答案旨在同时用作以下两种用途:

一个完整而详细的答案,完全符合@Suman的要求。

“我想解析X.有人能帮我吗?”的介绍性通用答案。

我将从使用基本的Raku开始,然后逐步获得更实质性的解决方案。

A

一行

解决方案

Raku对于那些喜欢使用shell的人来说工作得很好:

代码语言:javascript
复制
> raku -e 'for slurp() ~~ m:g / "@article\{" (<-[,]>+) \, \s+
"title=\{" (<-[}]>+) \} / -> $/ { put "$0: $1\n" }' < derm.bib

相当神秘,对吧?:)

(实际上,虽然这对某些人来说是“线路噪音”

注释

了解使用shell或Raku的基础知识,对于以下用户来说非常简单

有吗?

。)

有时最好像上面这样保持代码简短。

1

也就是说,本答案的其余部分将涵盖可读性更好、功能更强或两者兼而有之的代码。

单个语句,但在脚本中

在此脚本中,我:

跳过shell转义。

空格码输出。

添加一些介绍性注释。(它们是不现实的--开发人员通常会假设阅读Raku代码的人了解Raku基础知识--但我认为对于那些不了解Raku的人来说,它们可能总比什么都不了解要好。)

使用

要运行以下代码,请执行以下操作

代码语言:javascript
复制
for slurp()                     # "slurp" (read all of) stdin and then

~~ m :global                    # match "globally" (all matches) against a

/ '@article{' (<-[,]>+) ',' \s+ # (Raku dialect) regex that uses (`(...)`)
  'title={' (<-[}]>+) '}'  /    # to capture the article id and title and

-> $/ { put "$0: $1\n" }        # for each article, print "article id: title".

不要担心,如果上面的一些甚至所有内容现在看起来都像是纯粹的官样文章。请继续阅读,以获得更简单的代码和上述内容的详细解释。

将事情分解为四个语句

到目前为止,我们已经看到了一个超级压缩的shell单行程序,或者压缩程度较低但仍然紧凑的脚本。这样的解决方案对于生命周期较短的代码或一次性代码是有意义的。

但它通常适合编写易于阅读和维护的代码。在本节中,我们继续介绍基本的Raku,同时仍然朝着更具可读性和可维护性的代码迈进一步:

代码语言:javascript
复制
my \input = slurp;

my \pattern = rule { '@article{'       ( <-[,]>+ ) ','
                       'title={' ~ '}' ( <-[}]>+ ) }

my \articles = input .match: pattern, :global;

for articles -> $/ { put "$0: $1\n" }

声明一个词法变量。Raku支持

符号

(

https://en.wikipedia.org/wiki/Sigil_(computer_programming%29

),但它也允许开发人员像我一样“删掉它们”。

该模式与一行程序中的模式几乎相同,值得详细讨论:

代码语言:javascript
复制
my \pattern = rule { '@article{'       ( <-[,]>+ ) ','
                       'title={' ~ '}' ( <-[}]>+ ) }

Raku模式通常称为正则表达式或

规则

..。有几个regex/rule声明符。模式语言(在内部使用

或者

)是相同的;不同的声明符只是改变了模式的应用方式。

语法

Raku的等同于经典正则表达式的声明。

在大多数上下文中,表单的代码

立即调用匹配,就像经典的正则表达式一样。你经常要写

如果您只是希望将正则表达式作为一个值,以便在以后进行匹配。的详细语法

;后一种语法通常用于

名称

一个正则表达式。

一个正则表达式

回溯

当它匹配时,这是另一种方式,它就像一个经典的正则表达式。

不需要通过回溯来匹配

格式。除非你

需求

回溯一下,明智的做法是考虑使用其他regex/rule声明符来代替。我已经切换到使用关键字声明的规则

..。

使用声明的规则

与用

(或

)除了A)它不会回溯,B)它将其模式中的空格解释为对应于输入中可能的空格。你有没有注意到我掉了

从紧接之前的模式

什么?这是因为一个

会自动处理的。

另一个不同之处是我写了:

代码语言:javascript
复制
'title={' ~ '}' (...)

而不是:

代码语言:javascript
复制
'title={' (...) '}'

即移动

比特即将到来

之后

大括号,并添加一个

而不是它所在的地方。

以上两个

模式匹配相同的输入。我可以用任何一种方式在

我在一行程序中使用的表单,在此解决方案的

模式。但我希望这个解决方案更多地以“最佳实践”为导向。我将推迟对这两者之间的区别进行完整的解释

模式,以及此规则的所有其他详细信息,直到

“bib”语法的解释

下面的部分。

代码语言:javascript
复制
my \articles = input .match: pattern, :global;

这一行使用方法form (

)

单行解决方案中使用的例程。

是的别名

..。添加

/

方法时添加到参数列表中。

方法(或

例程),如果您希望查找与输入字符串中的匹配项一样多的匹配项,而不仅仅是第一个匹配项。

在以下情况下匹配的结果

使用的是

一份清单

对象

而不是只有一个。在本例中,我们将获得三篇文章,与输入文件中的三篇文章相对应。

代码语言:javascript
复制
for articles -> $/ { print "$0: $1\n\n" }

这个

语句连续绑定一个

与示例文件中的每个项目对应的

的街区。

$/上的Raku文档

,“

是匹配变量...因此通常包含类型为

。它还提供了一些其他便利,我们在这里利用这些便利之一,如下所述。

The The The

前面匹配的包含两对括号。他们生成了“位置捕获”(即

编号

捕获而不是命名的对象)。

总体而言

对象(

)提供对其两个

通过

位置下标

(后缀

)。

因此,在

块,

提供对给定文章的两个位置捕获的访问。但是Raku的别名

(以此类推)方便起见,因此大多数开发人员使用较短的语法。

插曲

这将是一个休息的好时机。也许只是一杯茶,或者改天再来。

这个答案的最后一部分详细解释了一种基于语法的方法。阅读它可能会进一步洞察上面的解决方案,并展示如何将Raku的解析扩展到更复杂的场景。

但首先..。

解决方案3:“无聊的”实用方法

我想用Raku来解析这个。有人能帮上忙吗?

与其他工具相比,Raku可以使编写解析器变得不那么单调乏味。但是,不那么单调乏味仍然是单调乏味的。而且Raku解析目前很慢。

在大多数情况下,当您想要解析众所周知的格式和/或非常大的文件时,实际的答案是找到并使用现有的解析器。这可能意味着根本不使用Raku,或者使用现有的Raku模块,或者使用Raku中现有的非Raku解析器。

建议的起始点是在上搜索文件格式

modules.raku.org

或者

raku.land

希望找到一个已经为Raku专门为给定文件格式打包的公共共享解析模块。要么是纯Raku库,要么是非Raku库的包装器。然后做一些简单的测试,看看你是否有一个好的解决方案。

在写这篇文章的时候,还没有'bib‘的匹配项。

几乎可以肯定的是,已经有一个'bib‘解析C库可用。这可能是最快的解决方案。您也有可能在自己的Raku代码中相对容易而优雅地使用打包为C库的外部解析库,即使您不知道如何使用C

NativeCall

..。如果文档太多或太少,请考虑

访问freenode IRC频道#raku

并请求任何你需要或想要的NativeCall帮助。

如果C库不适合特定的用例,那么您可能仍然可以使用用其他语言编写的包,例如Perl、Python、Ruby、Lua等,通过

他们各自的

语言适配器

..。只需安装Perl、Python或您想要任何包;确保它使用其他语言运行;安装适当的语言适配器;然后

像使用Raku包一样使用该包及其功能

包含导出的Raku函数、类、对象、值等。

Perl适配器是最成熟的,所以我将以它为例。假设您使用Perl的Text::BibTex包,现在希望将Raku与Perl中现有的Text::BibTeX::BibFormat模块一起使用。首先,按照它们的自述文件等设置Perl包。然后,在Raku中,编写如下内容:

代码语言:javascript
复制
use Text::BibTeX::BibFormat:from;
...
@blocks = $entry.format;

第一行是如何告诉Raku您希望加载Perl模块。)它不会工作,除非

已安装并正常工作。但如果您使用的是一个流行的Rakudo Raku包,那么它应该是正确的。如果没有,你至少应该有模块安装程序

这样你就可以运行

。)

最后一行是Raku的机械翻译

行自

Perl包的概要

..。

Raku语法/解析器

好的。不要再提“无聊”的实用建议了。现在,让我们尝试为您的问题中的示例创建一个足够好的Raku解析器。

使用

要运行以下代码,请执行以下操作

代码语言:javascript
复制
unit grammar bib;

rule TOP { * }

rule article { '@article{' $=<-[,]>+ ','
                  
               '}'
}

rule kv-pairs { * % ',' }
        
rule kv-pair { $=\w* '={' ~ '}' $=<-[}]>* }

With this grammar in place, we can now write something like:
die "Use CommaIDE?" unless bib .parsefile: 'derm.bib';

for $ -> $/ { put "$: $[0]\n" }

to generate exactly the same output as the previous solutions.

When a match or parse fails, by default Raku just returns Nil, which is... notably minimal feedback!
There are several nice debugging options to figure out what's going on with a regex or grammar, but the best option by far is to use CommaIDE's Grammar-Live-View.
If you haven't already installed and used Comma, you're missing one of the best parts of using Raku. The features built in to the free version of Comma ("Community Edition") include outstanding grammar tracing / match debugging.
Explanation of the 'bib' grammar
unit grammar bib;

The unit declarator is used at the start of a source file to tell Raku that the rest of the file declares a named package of code.
The grammar keyword specifies which sort of package. A grammar is like a class, but a grammar contains named "rules" -- not just named methods, but also named regexs, tokens, and rules. A grammar also inherits a bunch of general purpose rules from a base grammar.

rule TOP {

Unless you specify otherwise, parsing routines (.parse and .parsefile) start out by calling a grammar's rule named TOP (declared with rule, token, regex, or method).
As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't backtrack. So they eliminate the risk of pathologically slow backtracking.)
I've used a rule. Like token patterns, rules also avoid backtracking. But in addition they interpret whitespace in the pattern to be significant in a suitable manner. This is typically appropriate towards the top of the parse tree. (Tokens, and the occasional regex, are typically appropriate towards the leaves.)

rule TOP { * }

The space at the end of the rule means the grammar will match any amount of whitespace at the end of the input.
 invokes another named rule in this grammar.
Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of *.

rule article { '@article{' $=<-[,]>+ ','
                  
               '}'
}

If you compare this article pattern with the ones I wrote for the other solutions, you'll see various changes. In the original one-liner I made no attempt to echo the layout of typical input and I kept the pattern as simple as could possibly work. For this grammar I've made the pattern a bit more extensive, with a distinct  and closing }, and I've visually laid these parts out to correspond to the input.
<[...]> is the Raku syntax for a character class, like[...] in traditional regex syntax. It's more powerful, but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in the [^,] syntax of ye olde regex. So <-[,]>+ attempts a match of one or more characters, none of which are ,.
$=<-[,]>+ tells Raku to attempt to match the quantified atom on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key  within the current match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.

rule kv-pairs { * % ',' }

This pattern illustrates one of several convenient Raku regex features. It declares you want to match zero or more kv-pairs separated by commas.
(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)

rule kv-pair { $=\w* '={' ~ '}' $=<-[}]>* }

The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be much clearer.
Constructing / deconstructing the parse tree
for $ { put "$}: $[0]\n" }`

$, $ etc. refer to named match objects that are stored somewhere in the "parse tree". But how did they get there? And exactly where is "there"?
Returning to the top of the grammar:
rule TOP {

If a .parse is successful, a single 'TOP' level match object is returned. (After a parse is complete the variable $/ is also bound to that top match object.) During parsing a tree will have been formed by hanging other match objects off this top object, and then others hung off those, and so on.
Addition of match objects to a parse tree is done by assigning either a single generated match object, or a list of them, to either a Positional (numbered) or Associative (named) capture of a "parent" match object. This process is explained below.

rule TOP { * }

A rule invocation like  has two effects. First, Raku tries to match the rule. Second, if it matches, Raku captures that match by generating a corresponding match object and adding it to the parse tree as a named key "child" of the parent match object. In this case it'll be stored in the  key of the top match object of the parse tree.
If the successfully matched pattern had been just , rather than *, then only one match would have been attempted and only one value, a single match object, would have been generated and added under the key .
But the pattern was *, not merely . So Raku attempts to match the article rule multiple times, and, if it matches at all, then a list of one or more match objects is stored as the value of the  key. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)
$ is short for $/. It refers to the value stored under the  key of the current match object (which is stored in $/). In this case that value is a list of 3 match objects corresponding to the 3 articles in the input.

rule article { '@article{' $=<-[,]>+ ','

Each of the three match objects stored in the list stored under the  key of the top match object has several "child" match objects generated in accord with the article rule's pattern. The first of these is due to the $=<-[,]>+ part.
To explain this, let's consider just the first of the three article match objects, the one corresponding to the text that starts "@article{garg2017patch,...". The article rule matches this article. As it's doing that matching, the $=<-[,]>+ part tells Raku to store the match object corresponding to the id part of the article under that article match object's  key.

Hopefully this is enough (quite possibly way too much!) and I can at last exhaustively (exhaustingly?) explain the last line of code, which, once again, was:
for $ -> $/ { put "$: $[0]\n" }`

At the level of the for, the variable $/ refers to the top of the parse tree generated by the parse that just completed. Thus $, which is shorthand for $/, refers to the list of three article match objects.
The for then iterates over that list, binding $/ -- *within the lexical scope of the -> $/ { ... } block -- to each of those 3 article match objects in turn.
The $ bit is shorthand for $/, which inside the block refers to the  key within the article match object that $/ has been bound to. In other words, $ inside the block is equivalent to $ outside the block.
The $[0] follows the same scheme, albeit with more levels and a positional child (the [0]) in the midst of all the named (associative) children.
(Note that there was no need for the article pattern to include a $= because Raku just presumes a pattern of the form  should store its results under the key . If you wish to disable that, write a pattern with a non-alpha character as the first symbol. For example, use <.foo> if you want to have exactly the same matching effect as  but just not store the matched input in the parse tree.)
Phew!
When the automatically generated parse tree isn't what you want
As if all the above were not enough, I need to mention one more thing.
The parse tree strongly reflects the implicit tree structure of the grammar. But getting this structure as a result of a parse is sometimes inconvenient. One may want a different tree structure instead, perhaps a much simpler tree, perhaps some non-tree data structure.
The primary mechanism for generating exactly what you want from a parse, when the automatic results aren't suitable, is make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)
In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree.
Finally, the primary use case for these sparse trees is storing an AST.
Footnotes
[1] Basic Raku is good for exploratory programming, spikes, one-offs, PoCs and other scenarios where the emphasis is on quickly producing working code that can be refactored later if need be.
票数 39
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/45172113

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档