谢谢你的帮助。我想使用以下python代码来读取和处理来自affymetrix微阵列数据集的数据。我想阐明克罗恩病和溃疡性结肠炎在单个核细胞中的差异基因表达。代码运行得很好,但是当我尝试查看X的内容时,我在输出中得到一个空数组(如: array([],dtype=float64)),这当然没有用。这里有一个到原始数据集的链接:https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615我一直在尝试找出为什么我有一个空的和无法处理的输出,但没有用。代码如下:
import gzip
import numpy as np
"""
Read in a SOFT format data file. The following values can be exported:
GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample descriptions of length d
X : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"
## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
SIF = {}
for line in fid:
if line.startswith(line, len("!dataset_table_begin")):
break
elif line.startswith(line, len("!subject_description")):
subset_description = line.split("=")[1].strip()
elif line.startswith(line, len("!subset_sample_id")):
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
## Next line is the column headers (sample id's)
SID = next(fid).split("\t")
## The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith("GSM")]
## Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
## Get a list of sample labels
STP = [SIF[k] for k in SID]
## Read the gene expression data as a list of lists, also get the gene
## identifiers
GID,X = [],[]
for line in fid:
## This is what signals the end of the gene expression data
## section in the file
if line.startswith("!dataset_table_end"):
break
V = line.split("\t")
## Extract the values that correspond to gene expression measures
## and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(x)
GID.append(V[0] + ";" + V[1])
X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]X.shape Out95:(0,)
再次感谢您的建议。
发布于 2017-04-13 16:57:29
这样做效果很好:
import gzip
import numpy as np
"""
Read in a SOFT format data file. The following values can be exported:
GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample desriptions of length d
X : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"
## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
SIF = {}
for line in fid:
if line.startswith(b"!dataset_table_begin"):
break
elif line.startswith(b"!subset_description"):
subset_description = line.decode('utf8').split("=")[1].strip()
elif line.startswith(b"!subset_sample_id"):
subset_ids = line.decode('utf8').split("=")[1].split(",")
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
## Next line is the column headers (sample id's)
SID = next(fid).split(b"\t")
## The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")]
## Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
## Get a list of sample labels
STP = [SIF[k.decode('utf8')] for k in SID]
## Read the gene expression data as a list of lists, also get the gene
## identifiers
GID,X = [],[]
for line in fid:
## This is what signals the end of the gene expression data
## section in the file
if line.startswith(b"!dataset_table_end"):
break
V = line.split(b"\t")
## Extract the values that correspond to gene expression measures
## and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(x)
GID.append(V[0].decode() + ";" + V[1].decode())
X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]结果:
X.shape Out4:(22283,127)
https://stackoverflow.com/questions/43368128
复制相似问题