文章/答案/技术大牛

发布

社区首页 >问答首页 >循环遍历Word/PDF文档并将特定文本提取到表R中

问循环遍历Word/PDF文档并将特定文本提取到表R中
EN

Stack Overflow用户

提问于 2018-01-05 11:39:23

回答 1查看 958关注 0票数 0

我有一个文件夹约150字和PDF (相同的文本)文件。数据在这里：sheet2003.pdf

文本总是类似于(在用pdftools加载之后)：

library(pdftools)
u <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")

[1] "                                                                     Product Data Sheet\r\n                                                                                      001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n                                  ERp57 Polyclonal Antibody\r\nSource: Goat                                               phospholipase C alpha, PI PLC, protein disulfide\r\n                                                           isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This                   Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin        (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate           sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has         purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by        Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their            within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates.                                   ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated                Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide              blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident            mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample                Western blot      Immuno-        Histochemistry (paraffin)     Histochemistry (frozen)\r\n                                        fluorescence\r\nhuman                 +++               +++            +++                           +++\r\nrat                   +++               +++            +++                           +++\r\nmouse                 +++               +++            +++                           +++\r\ncanine                +++               +++            +++                           +++\r\nmonkey                +++               +++            +++                           +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot                    1:500-1:2,000       Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence                        1:50-1:500       at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin)        1:200-1:1,000\r\nImmunohistochemistry (frozen)          1:200-1:1,000       Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                           information@sicgen.pt\r\n"
[2] "                                                                          Product Data Sheet\r\n                                                                                             001 Rev1 Jan 2012 by JR\r\nReferences:\r\n                                    For research use only, not for diagnostic use\r\nSICGEN's Proprietary Immunogen Policy\r\nIn order to produce high specific antibodies SICGEN has invested a lot of time and effort into selecting immunogen\r\nsequences. SICGEN has decided to protect this information by not publishing it on the website. However, these sequences\r\nare available on request.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                                  information@sicgen.pt\r\n"

我希望在R或excell中转换成数据或表。

 Catalogue.No.  Name Source.
1    AB0003-200 ERp57    Goat
2    AB0004-500 (...)   (...)
                                                                                                  General.Description
1 Goat polyclonal to ERp57 -  endoplasmic reticulum lumen marker.  This endoplasmic reticulum protein interacts (...)
2                                                                                                               (...)
                        Alternative.names.
1 58 kDa glucose  regulated protein, (...)
2                                    (...)
                                                               Form.
1 Polyclonal antibody supplied as a  200 µl (2 mg/ml) aliquot in PBS
2                                                              (...)
                                                       Immunogen
1 Recombinant peptide derived  from within residues 300 aa (...)
2                                                          (...)
                       Specificity.                     Reactivity.
1 Detects a band of  60 kDa by(...) Reacts against  human, rat, ...
2                             (...)                           (...)
                                         Usage.
1 Western blot 1:500-1:2,000 Immunofluorescence
2                                         (...)

我想把它格式化成表格格式。以下是从PDF文件中导入的内容。

textImport <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] "                                                                     Product Data Sheet\r\n                                                                                      001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n                                  ERp57 Polyclonal Antibody\r\nSource: Goat                                               phospholipase C alpha, PI PLC, protein disulfide\r\n                                                           isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This                   Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin        (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate           sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has         purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by        Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their            within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates.                                   ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated                Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide              blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident            mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample                Western blot      Immuno-        Histochemistry (paraffin)     Histochemistry (frozen)\r\n                                        fluorescence\r\nhuman                 +++               +++            +++                           +++\r\nrat                   +++               +++            +++                           +++\r\nmouse                 +++               +++            +++                           +++\r\ncanine                +++               +++            +++                           +++\r\nmonkey                +++               +++            +++                           +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot                    1:500-1:2,000       Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence                        1:50-1:500       at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin)        1:200-1:1,000\r\nImmunohistochemistry (frozen)          1:200-1:1,000       Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt                                                                           information@sicgen.pt\r\n"
[2] "                                                                          Product Data Sheet\r\n                                                                                             001 Rev1 Jan 2012 by JR\r\nReferences:\r\n                                    For research use only, not for diagnostic use\r\n

如果你有什么建议，请告诉我。

excel

pdf

ms-word

text-mining

回答 1

Stack Overflow用户

发布于 2018-01-05 15:18:30

无法在注释中发布代码，因此这里有一种使用pdftools和正则表达式的可能方法。

数据

我使用了您提供的相同数据，并将其保存到一个名为"pdf_catalogue.pdf“的pdf中。

码

library(pdftools)
u <- pdf_text("pdf_catalogue.pdf")

get_string <- function(pattern, string){
  inter_list <- regmatches(string, regexec(pattern, string))
  if(length(inter_list) > 0){

    replace_patterns_list <- list("\r", "\n") #add others as required
    replace_patterns <- paste(unlist(replace_patterns_list), collapse = "|")

    inter_string <- gsub(replace_patterns, "", inter_list[[1]][2])
    return(inter_string)
  }

}

pat_source <- "Source: (.*)General description"
pat_description <- "General description: (.*)Alternative"
pat_form <- "Form: (.*)Immunogen"
pat_names <- "Alternative names: (.*)Form"

dat <- list(Source = get_string(pat_source, u),
        General_description = get_string(pat_description, u), 
        Form = get_string(pat_source, u), 
        Alternative_names = get_string(pat_names, u))

get_string函数返回(.*)之前和之后字符串之间包含的任何内容。这是基于这样的假设:正如您的问题所暗示的，文件结构是一致的。如果需要的话，您可能需要使用(.*?)进行“懒惰搜索”。如果你不熟悉正则表达式的话，有一个非常出色的视频。

输出

> dat
$Source
[1] "Goat"

$General_description
[1] "Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker.This endoplasmic reticulum protein interacts with lectin chaperones calreticulin andcalnexin to modulate folding of newly synthesized glycoproteins. It has disulfideisomerase activity and complexes of lectins and this protein mediate protein folding bypromoting formation of disulfide bonds in their glycoprotein substrates."

$Form
[1] "Goat"

$Alternative_names
[1] "58 kDa glucose regulated protein, 58 kDa microsomal protein,disulfide isomerase ER 60, endoplasmic reticulum resident protein 57, endoplasmicreticulum resident protein 60, ER protein 57, ER protein 60, ER protein 61, ERP57,ERp60, ERp61, glucose regulated protein 58 Kd, GRP57, GRP58, HsT17083, P58,PDIA3, phospholipase C alpha, PI PLC, protein disulfide isomerase A3 antibody."

您可能希望根据结构进一步拆分输出。例如，在Alternative names中，名称都是用逗号分隔的。你可以试试

> strsplit(dat$Alternative_names, ", ")

这给

[[1]]
 [1] "58 kDa glucose regulated protein"                   
 [2] "58 kDa microsomal protein,disulfide isomerase ER 60"
 [3] "endoplasmic reticulum resident protein 57"          
 [4] "endoplasmicreticulum resident protein 60"           
 [5] "ER protein 57"                                      
 [6] "ER protein 60"                                      
 [7] "ER protein 61"                                      
 [8] "ERP57,ERp60"                                        
 [9] "ERp61"                                              
[10] "glucose regulated protein 58 Kd"                    
[11] "GRP57"                                              
[12] "GRP58"                                              
[13] "HsT17083"                                           
[14] "P58,PDIA3"                                          
[15] "phospholipase C alpha"                              
[16] "PI PLC"                                             
[17] "protein disulfide isomerase A3 antibody."

注意，在逗号(,)之后使用空格会导致第二个元素有两个名称。您需要使用,来避免此类错误。这对于.pdf文件尤其重要。您还可以通过适当地定义断点(句点后面跟着大写字母)，轻松地将多行分割成单独的字段。正则表达式应该允许您处理所有这样的用例。

这是一个非常小的例子，但您可以轻松地构建它，以涵盖您可能希望从文件中获得的其他字段/组合。

对于多个文件，我建议将所有这些都封装在一个函数中(一旦您完成了代码)，并使用lapply遍历目录。我使用类似的方法来检查.txt和.csv文件。

希望这会有帮助。干杯!

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/48112885

复制

相似问题

问循环遍历Word/PDF文档并将特定文本提取到表R中
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问循环遍历Word/PDF文档并将特定文本提取到表R中EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问循环遍历Word/PDF文档并将特定文本提取到表R中
EN