在过去的几周里,我一直试图使用Scikit Allel库将基因组数据集加载到Neo4j中。我已经设法将外显体的所有变体加载到VCF文件中,并使用相关的表型数据加载所有主题,现在我只是尝试创建变体和主题之间的关系。我对python不是很有经验,我认为这个问题不需要对基因组学或Scikit-Allel库有很好的理解,所以不要被它吓跑了。
下面的代码可以工作,但速度非常慢。我认为为每个主题创建一个数据帧或列表集可能是一种更好的方法,而不是为j for循环中的每个元素创建一个图形事务,但是任何关于如何最好地做到这一点的建议都会受到欢迎。我也考虑过将Numba包含在其中,但不知道从哪里开始。
这个过程是每小时在~1750个对象和23个染色体中每条染色体创建~1个新主题,因此当前设置将需要很长时间才能工作。
import pandas as pd
import csv
import math
import allel
import zarr
from py2neo import Graph, Node, Relationship, NodeMatcher
zarr_path = '.../chroms.zarr'
callset = zarr.open_group(zarr_path, mode='r')
samples = callset[chrom]['samples']
graph = Graph(user="neo4j", password="password")
chrom_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X']
for chrom in chrom_list:
variants = allel.VariantChunkedTable(callset[chrom]['variants'], names=['AC','AF_AFR', 'AF_AMR', 'AF_ASN', 'AF_EUR', 'AF_MAX', 'CGT', 'CLR', 'CSQ', 'DP', 'DP4', 'ESP_MAF', 'FILTER_LowQual', 'FILTER_MinHWE', 'FILTER_MinVQSLOD', 'FILTER_PASS', 'HWE', 'ICF', 'ID', 'IS', 'PC2', 'PCHI2', 'POS', 'PR', 'QCHI2', 'QUAL', 'REF', 'ALT', 'INDEL', 'SHAPEIT', 'SNP_ID', 'TYPE', 'UGT', 'VQSLOD', 'dbSNPmismatch', 'is_snp', 'numalt'], index='POS')
pos = variants['POS'][:]
SNPid = variants['ID'][:]
ref = variants['REF'][:]
alt = variants['ALT'][:]
dp = variants['DP'][:]
ac = variants['AC'][:]
vartype = variants['TYPE'][:]
qual = variants['QUAL'][:]
vq = variants['VQSLOD'][:]
numalt = variants['numalt'][:]
csq = variants['CSQ'][:]
vcfv = 'VCFv4.1'
refv = 'file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta'
dpz = callset[chrom]['calldata/DP']
psz = callset[chrom]['calldata/PS']
plz = callset[chrom]['calldata/PL']
gpz = callset[chrom]['calldata/GP']
calldata = callset[chrom]['calldata']
gt = allel.GenotypeDaskArray(calldata['GT'])
hap = gt.to_haplotypes()
hap1 = hap[:, ::2]
hap2 = hap[:, 1::2]
i = 0
for i in range(len(samples)):
subject = samples[i]
subject_node = matcher.match("Subject", subject_id= subject)
if subject_node.first() is None:
continue
seq_tech = 'Illumina HiSeq 2000 (ILLUMINA)'
dp = dpz[:, i]
ps = psz[:, i]
pl = plz[:, i]
gp = gpz[:, i]
list_h1 = hap1[:, i].compute()
list_h2 = hap2[:, i].compute()
chrom_label = "Chromosome_" + str(chrom)
j = 0
for j in range(len(pos)):
h1 = int(list_h1[j])
h2 = int(list_h2[j])
read_depth = int(dp[j])
ps1 = int(ps[j])
PL0 = int(pl[j][0])
PL1 = int(pl[j][1])
PL2 = int(pl[j][2])
genotype = str(h1) + '|' + str(h2)
GP0 = float(gp[j][0])
GP1 = float(gp[j][1])
GP2 = float(gp[j][2])
k = int(pos[j])
l = str(ref[j])
m = str(alt[j][h1-1])
o = str(alt[j][h2-1])
if h1 == 0 and h2 == 0:
a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA=h1, HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r2)
elif h1 == 0 and h2 > 0:
a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r2)
a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r3)
elif h1 > 0 and h2 == 0:
a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r2)
a2 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r3)
elif h1 == h2 and h1 > 0:
a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA = h1, HTB = h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r2)
else:
a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r2)
a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
graph.create(r3)
print("Subject " + subject + " completed.")
print(chrom_label + "completed.")任何帮助都是非常感谢的!
发布于 2020-02-05 15:36:13
单独创建大量元素总是很慢,这很大程度上是因为所需的网络跳数。您还可以在两者之间进行匹配,这将进一步增加时间。
解决这类问题的最好方法是查看批处理,无论是读还是写。虽然您也不能一次完成所有操作,但将您的操作批处理为一次至少几百个操作将具有显著的效果。在这种情况下,您可能需要执行批量读取,然后执行批量写入,依此类推。
因此,具体地说,希望对多个实体执行匹配(您可以为此使用"in“修饰符,或者您可能需要使用原始Cypher)。对于写入,使用相关节点和关系在本地构建一个子图,并在单个调用中创建该子图。
您的最佳批处理大小只能通过实验来发现,因此您可能不会第一次得到正确的结果。但是批处理绝对是这里的关键。
https://stackoverflow.com/questions/60058437
复制相似问题