我有一个文件夹约150字和PDF (相同的文本)文件。数据在这里:sheet2003.pdf
文本总是类似于(在用pdftools加载之后):
library(pdftools)
u <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n ERp57 Polyclonal Antibody\r\nSource: Goat phospholipase C alpha, PI PLC, protein disulfide\r\n isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates. ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample Western blot Immuno- Histochemistry (paraffin) Histochemistry (frozen)\r\n fluorescence\r\nhuman +++ +++ +++ +++\r\nrat +++ +++ +++ +++\r\nmouse +++ +++ +++ +++\r\ncanine +++ +++ +++ +++\r\nmonkey +++ +++ +++ +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot 1:500-1:2,000 Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence 1:50-1:500 at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin) 1:200-1:1,000\r\nImmunohistochemistry (frozen) 1:200-1:1,000 Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
[2] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nReferences:\r\n For research use only, not for diagnostic use\r\nSICGEN's Proprietary Immunogen Policy\r\nIn order to produce high specific antibodies SICGEN has invested a lot of time and effort into selecting immunogen\r\nsequences. SICGEN has decided to protect this information by not publishing it on the website. However, these sequences\r\nare available on request.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"我希望在R或excell中转换成数据或表。
Catalogue.No. Name Source.
1 AB0003-200 ERp57 Goat
2 AB0004-500 (...) (...)
General.Description
1 Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker. This endoplasmic reticulum protein interacts (...)
2 (...)
Alternative.names.
1 58 kDa glucose regulated protein, (...)
2 (...)
Form.
1 Polyclonal antibody supplied as a 200 µl (2 mg/ml) aliquot in PBS
2 (...)
Immunogen
1 Recombinant peptide derived from within residues 300 aa (...)
2 (...)
Specificity. Reactivity.
1 Detects a band of 60 kDa by(...) Reacts against human, rat, ...
2 (...) (...)
Usage.
1 Western blot 1:500-1:2,000 Immunofluorescence
2 (...)我想把它格式化成表格格式。以下是从PDF文件中导入的内容。
textImport <- pdf_text("AB0003_ERP57_AB_data_sheet200.pdf")
[1] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nCatalogue No. AB0003-200\r\nQty: 400 µg (2 mg/ml)\r\n ERp57 Polyclonal Antibody\r\nSource: Goat phospholipase C alpha, PI PLC, protein disulfide\r\n isomerase A3 antibody.\r\nGeneral description: Goat polyclonal to ERp57 -\r\nendoplasmic reticulum lumen marker. This Form: Polyclonal antibody supplied as a 200 µl\r\nendoplasmic reticulum protein interacts with lectin (2 mg/ml) aliquot in PBS, 20% glycerol and 0.05%\r\nchaperones calreticulin and calnexin to modulate sodium azide. This antibody is epitope-affinity\r\nfolding of newly synthesized glycoproteins. It has purified from goat antiserum.\r\ndisulfide isomerase activity and complexes of\r\nlectins and this protein mediate protein folding by Immunogen: Recombinant peptide derived from\r\npromoting formation of disulfide bonds in their within residues 300 aa to the C-terminus of human\r\nglycoprotein substrates. ERp57 produced in E. coli.\r\nAlternative names: 58 kDa glucose regulated Specificity: Detects a band of 60 kDa by Western\r\nprotein, 58 kDa microsomal protein, disulfide blot in the following canine, human, monkey,\r\nisomerase ER 60, endoplasmic reticulum resident mouse, rat whole cell lysates.\r\nprotein 57, endoplasmic reticulum resident protein\r\n60, ER protein 57, ER protein 60, ER protein 61,\r\nERP57, ERp60, ERp61, glucose regulated protein\r\n58 Kd, GRP57, GRP58, HsT17083, P58, PDIA3,\r\nReactivity: Reacts against human, rat, mouse, canine and monkey proteins.\r\nSample Western blot Immuno- Histochemistry (paraffin) Histochemistry (frozen)\r\n fluorescence\r\nhuman +++ +++ +++ +++\r\nrat +++ +++ +++ +++\r\nmouse +++ +++ +++ +++\r\ncanine +++ +++ +++ +++\r\nmonkey +++ +++ +++ +++\r\n+++ excellent, ++ good, + poor, ND not determined\r\nUsage: Western blot 1:500-1:2,000 Storage: Store at -20 C for long-term storage. Store\r\nImmunofluorescence 1:50-1:500 at 2-8 C for up to one month.\r\nImmunohistochemistry (paraffin) 1:200-1:1,000\r\nImmunohistochemistry (frozen) 1:200-1:1,000 Special instructions: Avoid freeze/thaw cycles.\r\nSICGEN - Research and Development in Biotechnology Ltd\r\nEstrada do Pombalinho, Rabaçal, 3230-544 PENELA – PORTUGAL\r\nwww.sicgen.pt information@sicgen.pt\r\n"
[2] " Product Data Sheet\r\n 001 Rev1 Jan 2012 by JR\r\nReferences:\r\n For research use only, not for diagnostic use\r\n如果你有什么建议,请告诉我。
发布于 2018-01-05 15:18:30
无法在注释中发布代码,因此这里有一种使用pdftools和正则表达式的可能方法。
数据
我使用了您提供的相同数据,并将其保存到一个名为"pdf_catalogue.pdf“的pdf中。
码
library(pdftools)
u <- pdf_text("pdf_catalogue.pdf")
get_string <- function(pattern, string){
inter_list <- regmatches(string, regexec(pattern, string))
if(length(inter_list) > 0){
replace_patterns_list <- list("\r", "\n") #add others as required
replace_patterns <- paste(unlist(replace_patterns_list), collapse = "|")
inter_string <- gsub(replace_patterns, "", inter_list[[1]][2])
return(inter_string)
}
}
pat_source <- "Source: (.*)General description"
pat_description <- "General description: (.*)Alternative"
pat_form <- "Form: (.*)Immunogen"
pat_names <- "Alternative names: (.*)Form"
dat <- list(Source = get_string(pat_source, u),
General_description = get_string(pat_description, u),
Form = get_string(pat_source, u),
Alternative_names = get_string(pat_names, u))get_string函数返回(.*)之前和之后字符串之间包含的任何内容。这是基于这样的假设:正如您的问题所暗示的,文件结构是一致的。如果需要的话,您可能需要使用(.*?)进行“懒惰搜索”。如果你不熟悉正则表达式的话,有一个非常出色的视频。
输出
> dat
$Source
[1] "Goat"
$General_description
[1] "Goat polyclonal to ERp57 - endoplasmic reticulum lumen marker.This endoplasmic reticulum protein interacts with lectin chaperones calreticulin andcalnexin to modulate folding of newly synthesized glycoproteins. It has disulfideisomerase activity and complexes of lectins and this protein mediate protein folding bypromoting formation of disulfide bonds in their glycoprotein substrates."
$Form
[1] "Goat"
$Alternative_names
[1] "58 kDa glucose regulated protein, 58 kDa microsomal protein,disulfide isomerase ER 60, endoplasmic reticulum resident protein 57, endoplasmicreticulum resident protein 60, ER protein 57, ER protein 60, ER protein 61, ERP57,ERp60, ERp61, glucose regulated protein 58 Kd, GRP57, GRP58, HsT17083, P58,PDIA3, phospholipase C alpha, PI PLC, protein disulfide isomerase A3 antibody."您可能希望根据结构进一步拆分输出。例如,在Alternative names中,名称都是用逗号分隔的。你可以试试
> strsplit(dat$Alternative_names, ", ")这给
[[1]]
[1] "58 kDa glucose regulated protein"
[2] "58 kDa microsomal protein,disulfide isomerase ER 60"
[3] "endoplasmic reticulum resident protein 57"
[4] "endoplasmicreticulum resident protein 60"
[5] "ER protein 57"
[6] "ER protein 60"
[7] "ER protein 61"
[8] "ERP57,ERp60"
[9] "ERp61"
[10] "glucose regulated protein 58 Kd"
[11] "GRP57"
[12] "GRP58"
[13] "HsT17083"
[14] "P58,PDIA3"
[15] "phospholipase C alpha"
[16] "PI PLC"
[17] "protein disulfide isomerase A3 antibody." 注意,在逗号(,)之后使用空格会导致第二个元素有两个名称。您需要使用,来避免此类错误。这对于.pdf文件尤其重要。您还可以通过适当地定义断点(句点后面跟着大写字母),轻松地将多行分割成单独的字段。正则表达式应该允许您处理所有这样的用例。
这是一个非常小的例子,但您可以轻松地构建它,以涵盖您可能希望从文件中获得的其他字段/组合。
对于多个文件,我建议将所有这些都封装在一个函数中(一旦您完成了代码),并使用lapply遍历目录。我使用类似的方法来检查.txt和.csv文件。
希望这会有帮助。干杯!
https://stackoverflow.com/questions/48112885
复制相似问题