首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >python regex从文本中省略复杂的引用样式

python regex从文本中省略复杂的引用样式
EN

Stack Overflow用户
提问于 2019-01-08 16:18:59
回答 3查看 98关注 0票数 1

我已经将文件的内容读入python,并希望去掉所有遵循相同格式的引用:

代码语言:javascript
复制
(Author et al., .............. \nGoogle Scholar) # there could be many '\nGoogle Scholar's within the brackets

介绍胰岛内分泌细胞在葡萄糖干扰下分泌胰岛素和胰高血糖素,维持葡萄糖稳态。分泌胰岛素的β细胞表现出形态、功能和分子变异,表明它们可能由具有特殊任务和生理反应的亚群体组成(Gutierrez etal,2007 Gutierrez G.D. Gromada J. Sussel L. cell.Front的异质性)。吉内。2017年;8: 22Crossref\n nPubMed\n nScopus(11)\n谷歌学者,Roscioni etal.,2016 Roscioni S.S. Migliorini A. Gegg M. Lickert H.胰岛结构对-cell异质性、可塑性和function.Nat的影响。内分泌醇牧师2016年;12: 695-709 709Crossref\n nPubMed\n nScopus(36)nGoogle学者。β细胞异质性的特征包括葡萄糖反应性和分泌活性。然而,在胰腺中可视化转录本是不可行的,如果不使用诸如光敏染料等专门技术(Cui etal,2008 Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J.安松C. Sussel L. Orr G.基于波动定位成像的荧光原位杂交(fliFISH),用于准确检测和计数2018年单一cells.Nucleic酸研究中的cells.Nucleic拷贝;46: e7Crossref\n nPubMed\n nScopus(2)\n谷歌nScopus)。我们优化了标准组织smFISH协议(Lyubimova .,2013 Lyubimova A. Itzkovitz S. Junker J.P. . Fan Z.P. Wu X. van Oudenaarden A.单分子mRNA在哺乳动物tissue.Nat中的检测和计数。普罗托科。2013年;8: 1743-1758Crossref\n nPubMed\n nScopus(62)\n nGoogle学者(通过大幅延长探针杂交步骤之前的mRNA变性时间,从5分钟增加到至少3小时)。

期望输出

介绍胰岛内分泌细胞在葡萄糖干扰下分泌胰岛素和胰高血糖素,维持葡萄糖稳态。分泌胰岛素的β细胞表现出形态、功能和分子变异,表明它们可能由具有特殊任务和生理反应的亚群体组成。β细胞异质性的特征包括葡萄糖反应性和分泌活性。然而,在胰腺中可视化转录本是不可行的,如果不使用专门的技术,如光敏染料。我们通过将探针杂交前的smFISH变性周期从5 5min大幅度增加到至少3小时,优化了标准的组织mRNA协议。

我找不到一个正则表达式,它一次忽略了所有引用,所以我不得不分两部分完成:

  1. 查找每个“\n谷歌学者”发生的所有位置。
  2. 从每个位置向后扩展,直到出现相应的开始括号,然后省略这些索引之间的字符。

我的尝试如下:

代码语言:javascript
复制
def remove(test_str):
        regex=re.compile('\\nGoogle Scholar\)')
        starts=[]
        ends=[]
        ret=''
        for end in regex.finditer(test_str): #find all 'Google Scholar)'
            ends.append(m.end())
        for e in ends:                       #find all starting brackets
            i=e
            while True:
                if bool(re.match('\(\D+',test_str[i-2:i])):
                    starts.append(i-2)
                    break
                else:
                    i-=1
        start=test_str[:starts[0]]           #omit all characters in between
        starts=starts[1:]
        end=test_str[ends[-1]:]
        ends=ends[:-1]
        for i,j in zip(starts,ends):
            ret=ret+test_str[j:i]
        return start+ret+end

但是,这个策略失败了,因为我用来查找每个起始括号(\(\D+)的正则表达式不够精确--通常在引用中有封闭括号。

(崔爱塔尔,2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G.基于波动定位成像的荧光原位杂交(fliFISH),用于准确检测和计数单个cells.Nucleic酸研究中的cells.Nucleic拷贝。2018年;46: e7Crossref\n nPubMed\n nScopus (2)\nGoogle nScopus)

因此,在这种情况下,搜索正确的开始托架过早停止..。

有人能推荐一个持续删除所有引用的好方法吗?

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-01-08 16:41:55

根据您描述的模式,您可以使用这个正则表达式,

代码语言:javascript
复制
(?s)\(.*?Google Scholar\) ?

用空字符串代替。在这里,(?s)是为了使.能够匹配新行。

检查这里

下面是一个python代码演示,

代码语言:javascript
复制
import re

s = 'Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22Crossref\nPubMed\nScopus (11)\nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709Crossref\nPubMed\nScopus (36)\nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7Crossref\nPubMed\nScopus (2)\nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758Crossref\nPubMed\nScopus (62)\nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.'

replacedStr = re.sub(r'(?s)\(.*?Google Scholar\) ?','',s)
print(replacedStr)

像你在帖子中提到的那样打印以下内容。

介绍胰岛内分泌细胞在葡萄糖干扰下分泌胰岛素和胰高血糖素,维持葡萄糖稳态。分泌胰岛素的β细胞表现出形态、功能和分子变异,表明它们可能由具有特殊任务和生理反应的亚群体组成。β细胞异质性的特征包括葡萄糖反应性和分泌活性。然而,在胰腺中可视化转录本是不可行的,如果不使用专门的技术,如光敏染料。通过将原杂交前的smFISH变性周期从5 5min大幅度提高到3小时以上,优化了标准的组织mRNA变性工艺。

票数 1
EN

Stack Overflow用户

发布于 2019-01-08 16:51:33

我将以以下方式解决这个问题,它与您想要的内容相匹配,并且可以处理文本中的括号(不是引用):

  1. 寻找开始的\(
  2. 查找[^()]+(?:\([^()]+\))?的重复,它是一个或多个不是括号的字符,后面是可选的一对( ),其中一个或多个字符不是括号。
  3. 寻找结束\nGoogle Scholar\)的方法
  4. 使用空格拆分和连接以删除多个空格

代码:

代码语言:javascript
复制
import re
text = 'Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22Crossref\nPubMed\nScopus (11)\nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709Crossref\nPubMed\nScopus (36)\nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7Crossref\nPubMed\nScopus (2)\nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758Crossref\nPubMed\nScopus (62)\nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.'
fixed_text = ' '.join(re.sub(r'\((?:[^()]+(?:\([^()]+\))?)+\nGoogle Scholar\)', '', text).split())
print(fixed_text)

输出:

介绍胰岛内分泌细胞在葡萄糖干扰下分泌胰岛素和胰高血糖素,维持葡萄糖稳态。分泌胰岛素的β细胞表现出形态、功能和分子变异,表明它们可能由具有特殊任务和生理反应的亚群体组成。β细胞异质性的特征包括葡萄糖反应性和分泌活性。然而,在胰腺中可视化转录本是不可行的,如果不使用专门的技术,如光敏染料。我们通过将探针杂交前的smFISH变性周期从5 5min大幅度增加到至少3小时,优化了标准的组织mRNA协议。

可以通过更改以下代码来进行改进,该代码还删除了前面的\(之前的空格,但是它与您想要的输出不匹配(这有缺陷):

代码语言:javascript
复制
fixed_text = re.sub(r' ?\((?:[^()]+(?:\([^()]+\))?)+\nGoogle Scholar\)', '', string)

介绍胰岛内分泌细胞在葡萄糖干扰下分泌胰岛素和胰高血糖素,维持葡萄糖稳态。分泌胰岛素的β细胞表现出形态、功能和分子变异,表明它们可能由具有特殊任务和生理反应的亚群体组成。β细胞异质性的特征包括葡萄糖反应性和分泌活性。然而,在胰腺中可视化转录本是不可行的,如果不使用专门的技术,如光敏染料。我们通过将探针杂交前的smFISH变性周期从5 5min大幅度增加到至少3小时,优化了标准的组织mRNA协议。

票数 0
EN

Stack Overflow用户

发布于 2019-01-08 16:55:37

代码语言:javascript
复制
import re

if __name__ == '__main__':
    source = """Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses (Gutierrez etal., 2017Gutierrez G.D. Gromada J. Sussel L. Heterogeneity of the pancreatic beta cell.Front. Genet. 2017; 8: 22Crossref\nPubMed\nScopus (11)\nGoogle Scholar, Roscioni etal., 2016Roscioni S.S. Migliorini A. Gegg M. Lickert H. Impact of islet architecture on -cell heterogeneity, plasticity and function.Nat. Rev. Endocrinol. 2016; 12: 695-709Crossref\nPubMed\nScopus (36)\nGoogle Scholar). Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes (Cui etal., 2018Cui Y. Hu D. Markillie L.M. Chrisler W.B. Gaffrey M.J. Ansong C. Sussel L. Orr G. Fluctuation localization imaging-based fluorescence insitu hybridization (fliFISH) for accurate detection and counting of RNA copies in single cells.Nucleic Acids Res. 2018; 46: e7Crossref\nPubMed\nScopus (2)\nGoogle Scholar). We have optimized the standard tissue smFISH protocol (Lyubimova etal., 2013Lyubimova A. Itzkovitz S. Junker J.P. Fan Z.P. Wu X. van Oudenaarden A. Single-molecule mRNA detection and counting in mammalian tissue.Nat. Protoc. 2013; 8: 1743-1758Crossref\nPubMed\nScopus (62)\nGoogle Scholar) by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr."""
    output = re.sub(' \(.*? etal\., .*?\\nGoogle Scholar\)', '', source, flags=re.DOTALL)

    print(output)

输出

代码语言:javascript
复制
Introduction The endocrine cells in the pancreatic islets of Langerhans secrete insulin and glucagon in response to glucose perturbations to maintain glucose homeostasis. The insulin-secreting beta cells exhibit morphological, functional, and molecular variations, suggesting that they may consist of sub-populations with specialized tasks and physiological responses. Features of beta cell heterogeneity include glucose responsiveness and secretory activity ..... Visualizing transcripts in the pancreas, however, has been infeasible without the use of specialized techniques such as photoswitchable dyes. We have optimized the standard tissue smFISH protocol by substantially increasing the period of mRNA denaturation, which precedes the probe hybridization steps, from 5min to at least 3hr.
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/54095734

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档