首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用生物技术从外部短消息ID列表中提取多个摘要

用生物技术从外部短消息ID列表中提取多个摘要
EN

Stack Overflow用户
提问于 2022-07-15 22:41:35
回答 1查看 105关注 0票数 3

我试着用PubmedID从Pubmed中提取60K篇文章的摘要。我正试图把这些摘要导出到字典里去。我想我正在使用的代码有一些问题,尤其是在解析公共ID的时候。请帮助纠正代码,并让我知道哪里是错误的。

代码语言:javascript
复制
from Bio import Entrez
import sys

Entrez.email = 'anonymous@gmail.com'

abstract_dict = {}
without_abstract = []

pub_ids = sys.argv[1]
f = open(pub_ids, "r")
for i in f:
    handle = Entrez.efetch(db="pubmed", id=','.join(map(str, i)),
                        rettype="xml", retmode="text")
    records = Entrez.read(handle)
    abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
        if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys()
        else pubmed_article['MedlineCitation']['Article']['ArticleTitle']
            for pubmed_article in records['PubmedArticle']]
    abstract_dict = dict(zip(i, abstracts))
print(abstract_dict)

Pubmed Ids的一些示例是:

代码语言:javascript
复制
17284678
15531828
11791095
10708056

我得到的结果只有几行抽象或空洞的字典。是否可以将结果从字典导出到选项卡分隔的文本文件中?

如有任何建议,将不胜感激

谢谢

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-07-20 08:31:37

注意,Entrez.efetch只返回1000个记录。由于您表示要下载60K摘要,所以我已经修改了您的代码,以便批量下载摘要。

代码语言:javascript
复制
from Bio import Entrez
import sys
import csv

Entrez.email = 'anonymous@gmail.com'
 
def fetch_abstracts(pub_ids, retmax=1000, output_file='abstracts.csv'):    
    # Make sure requests to NCBI are not too big
    for i in range(0, len(pub_ids), retmax):
        j = i + retmax
        if j >= len(pub_ids):
            j = len(pub_ids)

        print(f"Fetching abstracts from {i} to {j}.")
        handle = Entrez.efetch(db="pubmed", id=','.join(pub_ids[i:j]),
                        rettype="xml", retmode="text", retmax=retmax)
        
        records = Entrez.read(handle)

        abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
                      if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys()
                      else pubmed_article['MedlineCitation']['Article']['ArticleTitle']
                          for pubmed_article in records['PubmedArticle']]

        abstract_dict = dict(zip(pub_ids[i:j], abstracts))

        with open(output_file, 'a', newline='') as csvfile:
            fieldnames = ['pub_id', 'abstract']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter='\t')
            if i == 0:
              writer.writeheader()
            for pub_id, abstract in abstract_dict.items():
              writer.writerow({'pub_id': pub_id, 'abstract': abstract})

if __name__ == '__main__':
  filename = sys.argv[1]
  pub_ids = open(filename, "r").read().splitlines()
  fetch_abstracts(pub_ids)

如果你像这样运行:

代码语言:javascript
复制
stack73000220.py pubids.txt

其中pubids.txt看起来像:

代码语言:javascript
复制
17284678
15531828
11791095
10708056

然后您将在abstracts.csv中获得以下输出

代码语言:javascript
复制
pub_id  abstract
17284678    Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation.
15531828    To study the occurrence of nosocomial diarrhea in pediatric wards and the role of infections in its causation.
11791095    Based on single case reports, parvovirus B19 (B19) has repeatedly been proposed as an etiologic agent in patients with Henoch-Schönlein purpura (HSP), perhaps causing vasculitis by direct invasion of vascular endothelial cells because of the tissue distribution of the cellular B19 receptor. A cohort of children with HSP and other vasculitic diseases was investigated and compared with healthy control children to assess the role of B19 as well as parvovirus V9 (a putative emerging B19-like virus).
10708056    The effects of chemokine and chemokine receptor genetic polymorphisms such as stromal derived factor 1 (SDF1-3'A), CCR2-64I, and CCR5-delta32 associated with HIV-1 transmission and/or rate of disease progression in infected study subjects remain highly controversial and have been analyzed primarily only in adults. We have investigated whether these polymorphisms may provide similar beneficial effects in children exposed to HIV-1 perinatally. The prevalence of CCR2-64I allele was significantly increased (p = .03) and the CCR2-64I genotype distribution was not in Hardy-Weinberg equilibrium, among HIV-1-exposed uninfected infants. Moreover, in the HIV-1-infected group, a delay to AIDS progression was observed among carriers of CCR2-64I allele. This is the first report that suggests a protective role of CCR2-64I allele in mother-to-infant HIV-1 transmission and documents a delay in disease progression, after the child has been infected with HIV-1. However, SDFI-3'A and CCR5-delta32 alleles did not modify the rate of HIV-1 transmission or disease progression in HIV-1-infected children.
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73000220

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档