文章/答案/技术大牛

发布

社区首页 >问答首页 >通过python (Py2neo)将大型数据集转录为Neo4j

问通过python (Py2neo)将大型数据集转录为Neo4j
EN

Stack Overflow用户

提问于 2020-02-04 21:25:42

回答 1查看 72关注 0票数 0

在过去的几周里，我一直试图使用Scikit Allel库将基因组数据集加载到Neo4j中。我已经设法将外显体的所有变体加载到VCF文件中，并使用相关的表型数据加载所有主题，现在我只是尝试创建变体和主题之间的关系。我对python不是很有经验，我认为这个问题不需要对基因组学或Scikit-Allel库有很好的理解，所以不要被它吓跑了。

下面的代码可以工作，但速度非常慢。我认为为每个主题创建一个数据帧或列表集可能是一种更好的方法，而不是为j for循环中的每个元素创建一个图形事务，但是任何关于如何最好地做到这一点的建议都会受到欢迎。我也考虑过将Numba包含在其中，但不知道从哪里开始。

这个过程是每小时在~1750个对象和23个染色体中每条染色体创建~1个新主题，因此当前设置将需要很长时间才能工作。

import pandas as pd
import csv
import math
import allel
import zarr
from py2neo import Graph, Node, Relationship, NodeMatcher

zarr_path = '.../chroms.zarr'

callset = zarr.open_group(zarr_path, mode='r')

samples = callset[chrom]['samples']

graph = Graph(user="neo4j", password="password")

chrom_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X']

for chrom in chrom_list:


    variants = allel.VariantChunkedTable(callset[chrom]['variants'], names=['AC','AF_AFR', 'AF_AMR', 'AF_ASN', 'AF_EUR', 'AF_MAX', 'CGT', 'CLR', 'CSQ', 'DP', 'DP4', 'ESP_MAF', 'FILTER_LowQual', 'FILTER_MinHWE', 'FILTER_MinVQSLOD', 'FILTER_PASS', 'HWE', 'ICF', 'ID', 'IS', 'PC2', 'PCHI2', 'POS', 'PR', 'QCHI2', 'QUAL', 'REF', 'ALT', 'INDEL', 'SHAPEIT', 'SNP_ID', 'TYPE', 'UGT', 'VQSLOD', 'dbSNPmismatch', 'is_snp', 'numalt'], index='POS')
    pos = variants['POS'][:]
    SNPid = variants['ID'][:]
    ref = variants['REF'][:]
    alt = variants['ALT'][:]
    dp = variants['DP'][:]
    ac = variants['AC'][:]
    vartype = variants['TYPE'][:]
    qual = variants['QUAL'][:]
    vq = variants['VQSLOD'][:]
    numalt = variants['numalt'][:]
    csq = variants['CSQ'][:]
    vcfv = 'VCFv4.1'
    refv = 'file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta'
    dpz = callset[chrom]['calldata/DP']
    psz = callset[chrom]['calldata/PS']
    plz = callset[chrom]['calldata/PL']
    gpz = callset[chrom]['calldata/GP']
    calldata = callset[chrom]['calldata']
    gt = allel.GenotypeDaskArray(calldata['GT'])
    hap = gt.to_haplotypes()
    hap1 = hap[:, ::2]
    hap2 = hap[:, 1::2]
    i = 0
    for i in range(len(samples)):
        subject = samples[i]
        subject_node = matcher.match("Subject", subject_id= subject)
        if subject_node.first() is None:
            continue
        seq_tech = 'Illumina HiSeq 2000 (ILLUMINA)'
        dp = dpz[:, i]
        ps = psz[:, i]
        pl = plz[:, i]
        gp = gpz[:, i]
        list_h1 = hap1[:, i].compute()
        list_h2 = hap2[:, i].compute()
        chrom_label = "Chromosome_" + str(chrom)
        j = 0
        for j in range(len(pos)):  
            h1 = int(list_h1[j])
            h2 = int(list_h2[j])
            read_depth = int(dp[j])
            ps1 = int(ps[j])
            PL0 = int(pl[j][0])
            PL1 = int(pl[j][1])
            PL2 = int(pl[j][2])
            genotype = str(h1) + '|' + str(h2)
            GP0 = float(gp[j][0])
            GP1 = float(gp[j][1])
            GP2 = float(gp[j][2])
            k = int(pos[j])
            l = str(ref[j])
            m = str(alt[j][h1-1])
            o = str(alt[j][h2-1])

            if h1 == 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA=h1, HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            elif h1 == 0 and h2 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 > 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 == h2 and h1 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA = h1, HTB = h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            else:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)
        print("Subject " + subject + " completed.")
print(chrom_label + "completed.")

任何帮助都是非常感谢的！

neo4j

py2neo

genome

python

performance

回答 1

Stack Overflow用户

发布于 2020-02-05 15:36:13

单独创建大量元素总是很慢，这很大程度上是因为所需的网络跳数。您还可以在两者之间进行匹配，这将进一步增加时间。

解决这类问题的最好方法是查看批处理，无论是读还是写。虽然您也不能一次完成所有操作，但将您的操作批处理为一次至少几百个操作将具有显著的效果。在这种情况下，您可能需要执行批量读取，然后执行批量写入，依此类推。

因此，具体地说，希望对多个实体执行匹配(您可以为此使用"in“修饰符，或者您可能需要使用原始Cypher)。对于写入，使用相关节点和关系在本地构建一个子图，并在单个调用中创建该子图。

您的最佳批处理大小只能通过实验来发现，因此您可能不会第一次得到正确的结果。但是批处理绝对是这里的关键。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60058437

复制

相似问题

问通过python (Py2neo)将大型数据集转录为Neo4j
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过python (Py2neo)将大型数据集转录为Neo4jEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过python (Py2neo)将大型数据集转录为Neo4j
EN