文章/答案/技术大牛

发布

社区首页 >问答首页 >fdf文件的faux解析器

问fdf文件的faux解析器
EN

Code Review用户

提问于 2019-01-24 03:34:39

回答 1查看 227关注 0票数 2

Introduction

如果您知道FDF文件是什么，可以跳过本节。

PDF文件有时包括表单字段。这些字段可以表示为使用某些实用程序提取的纯文本形式的FDF。关于此文件要注意的重要事项是：

字段由

/V ()
/T (Field)

其中，" field“是字段名，在/V旁边的括号中写入的任何内容都将成为填充表单的实用程序的输入。

有可能存在任意深度的嵌套字段。一个基本的例子是

/Kids [
<<
/V ()
/T (node1)
>> 
<<
/V ()
/T (node2)
>>]
/T (root)

这里有两个字段node1和node2，它们都嵌套在根下面。当字段被嵌套时，用户看到的名称是层次结构的组件，由.s分隔。例如，node1的名称是root.node1。

代码

这样做的总体目标是确定当用户提供一组字段名时需要修改的行(例如。Field，root.node1)

此代码读取FDF文件并标记应修改的行，以填充给定名称的字段。我的方法是迭代每一行，以检测树根，并将根的名称附加到每个子节点。因为这将以CRAN结束，我想确保这个方法不会太复杂。

fdfAnnotate = function(fdfLines){
  fields = vector(length = length(fdfLines),mode= 'character')
  nests = 0
  # iterate over every line
  for (i in seq_along(fdfLines)){
    if(grepl('/T \\(',fdfLines[i])){
      # /T represents a field or a root name
      # take the name
      name = stringr::str_extract(fdfLines[i],'(?<=\\().*?(?=\\))')
      if(grepl('/V',fdfLines[i-1])){
        # if the line before the naming line starts with /V
        # there is no hierarhcy, just name the line
        fields[i-1] = name
      } else if(grepl('>>\\]',fdfLines[i-1])){
        # if the line above the name is >>] the name represents a root
        # start reading from the line above
        z = i-2
        # this keeps track of the nest levels.
        # we will be reading the file backwards trying to
        # reach to the end of this root
        nest = 1
        while(nest!=0){
          if(grepl('/V',fdfLines[z])){
            # if a field is found, append the name of the root to the left
            # separated by a "."
            fields[z] = paste0(name,'.',fields[z])
          } else if(grepl('>>\\]',fdfLines[z])){
            # if another nest stops, that means we are inside another root
            nest = nest + 1
          } else if(grepl('/Kids \\[',fdfLines[z])){
            # every time a root closes reduce the nest. if you reach 0
            # it means its over
            nest = nest - 1
          }
          # go back one line in the file.
          z = z - 1
        }
      }
    }
  }
  data.frame(fdfLines,fields,stringsAsFactors = FALSE)
}

使用

你可以通过做

fdfLines = readLines([pathToFDFfile])
fdfAnnotate(fdfLines)

下面是用于测试的FDF文件。它包括单层和双层层次结构，以及一组正常字段。

%FDF-1.2
%âãÏÓ
1 0 obj 
<<
/FDF 
<<
/Fields [
<<
/V ()
/T (node1)
>> 
<<
/Kids [
<<
/Kids [
<<
/V ()
/T (node1)
>> 
<<
/V ()
/T (node3)
>> 
<<
/V ()
/T (node2)
>>]
/T (child)
>> 
<<
/Kids [
<<
/V ()
/T (node1)
>> 
<<
/V ()
/T (node2)
>>]
/T (child2)
>>]
/T (hierarchy2)
>> 
<<
/V ()
/T (TextField1)
>> 
<<
/V ()
/T (TextField2)
>> 
<<
/V ()
/T (TextFieldPage2)
>> 
<<
/V ()
/T (List Box)
>> 
<<
/V ()
/T (TextFieldPage3)
>> 
<<
/Kids [
<<
/V ()
/T (node1)
>> 
<<
/V ()
/T (node4)
>> 
<<
/V ()
/T (node3)
>> 
<<
/V ()
/T (node2)
>>]
/T (hierarchy)
>> 
<<
/V ()
/T (betweenHierarch)
>> 
<<
/V /Off
/T (RadioGroup)
>> 
<<
/V /Off
/T (checkBox)
>>]
>>
>>
endobj 
trailer

<<
/Root 1 0 R
>>
%%EOF

此文件表示这 pdf文件中的表单字段。

使用我的函数，输出是：

                fdfLines                  fields
1               %FDF-1.2                        
2                  %âãÏÓ                        
3               1 0 obj                         
4                     <<                        
5                  /FDF                         
6                     <<                        
7              /Fields [                        
8                     <<                        
9                  /V ()                   node1
10            /T (node1)                        
11                   >>                         
12                    <<                        
13               /Kids [                        
14                    <<                        
15               /Kids [                        
16                    <<                        
17                 /V ()  hierarchy2.child.node1
18            /T (node1)                        
19                   >>                         
20                    <<                        
21                 /V ()  hierarchy2.child.node3
22            /T (node3)                        
23                   >>                         
24                    <<                        
25                 /V ()  hierarchy2.child.node2
26            /T (node2)                        
27                   >>]                        
28            /T (child)                        
29                   >>                         
30                    <<                        
31               /Kids [                        
32                    <<                        
33                 /V () hierarchy2.child2.node1
34            /T (node1)                        
35                   >>                         
36                    <<                        
37                 /V () hierarchy2.child2.node2
38            /T (node2)                        
39                   >>]                        
40           /T (child2)                        
41                   >>]                        
42       /T (hierarchy2)                        
43                   >>                         
44                    <<                        
45                 /V ()              TextField1
46       /T (TextField1)                        
47                   >>                         
48                    <<                        
49                 /V ()              TextField2
50       /T (TextField2)                        
51                   >>                         
52                    <<                        
53                 /V ()          TextFieldPage2
54   /T (TextFieldPage2)                        
55                   >>                         
56                    <<                        
57                 /V ()                List Box
58         /T (List Box)                        
59                   >>                         
60                    <<                        
61                 /V ()          TextFieldPage3
62   /T (TextFieldPage3)                        
63                   >>                         
64                    <<                        
65               /Kids [                        
66                    <<                        
67                 /V ()         hierarchy.node1
68            /T (node1)                        
69                   >>                         
70                    <<                        
71                 /V ()         hierarchy.node4
72            /T (node4)                        
73                   >>                         
74                    <<                        
75                 /V ()         hierarchy.node3
76            /T (node3)                        
77                   >>                         
78                    <<                        
79                 /V ()         hierarchy.node2
80            /T (node2)                        
81                   >>]                        
82        /T (hierarchy)                        
83                   >>                         
84                    <<                        
85                 /V ()         betweenHierarch
86  /T (betweenHierarch)                        
87                   >>                         
88                    <<                        
89               /V /Off              RadioGroup
90       /T (RadioGroup)                        
91                   >>                         
92                    <<                        
93               /V /Off                checkBox
94         /T (checkBox)                        
95                   >>]                        
96                    >>                        
97                    >>                        
98               endobj                         
99               trailer                        
100                                             
101                   <<                        
102          /Root 1 0 R                        
103                   >>                        
104                %%EOF

parsing

回答 1

Code Review用户

回答已采纳

发布于 2019-01-25 03:34:23

我试着把头绕在这个文件格式上(真奇怪？！)并且意识到如果你从自下而上读取文件的话，构建树结构要容易得多，你同意吗？在此基础上，我提出了一个简单得多的实现，其中我只维护了一个字段名堆栈(最后，先出)。对于示例数据，输出是相同的，我希望我没有遗漏任何东西。

fdfAnnotate <- function(fdfLines) {
    fields <- vector(length = length(fdfLines), mode = "character")
    store <- NULL
    for (i in rev(seq_along(fdfLines))) {
        line <- fdfLines[i]
        if (grepl("/V", line)) {
            fields[i] <- paste(store, collapse = ".")
            store <- head(store, -1)
        } else if (grepl("/T [(]", line)) {
            name <- sub(".*[(](.*)[)].*", "\\1", line)
            store <- c(store, name)
        } else if (grepl("/Kids \\[", line)) {
           store <- head(store, -1)
        }
    }
    data.frame(fdfLines, fields, stringsAsFactors = FALSE)
}

一些一般性意见或其他建议

考虑到这将在CRAN上进行，我将尽可能地限制包依赖。在这里，我将stringr::str_extract替换为对基本sub的调用
您没有使用常用的语法来列出以下几个:使用<-进行赋值；使用双引号而不是单引号；在逗号后、二进制运算符之间、if之后、{之前使用空格
代码对输入做了很多假设，从本质上说，输入来自有效的fdf文件。也许这些假设中的一些应该被检验。
正则表达式足够强大吗？例如，您应该在适当的地方使用^和$吗？

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/212118

复制

相似问题

问fdf文件的faux解析器
EN

Introduction

代码

使用

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问fdf文件的faux解析器EN

Introduction

代码

使用

回答 1

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问fdf文件的faux解析器
EN