问我如何获得微阵列数据？
EN

Stack Overflow用户

提问于 2017-04-12 19:20:56

回答 1查看 246关注 0票数 0

谢谢你的帮助。我想使用以下python代码来读取和处理来自affymetrix微阵列数据集的数据。我想阐明克罗恩病和溃疡性结肠炎在单个核细胞中的差异基因表达。代码运行得很好，但是当我尝试查看X的内容时，我在输出中得到一个空数组(如: array([]，dtype=float64))，这当然没有用。这里有一个到原始数据集的链接：https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615我一直在尝试找出为什么我有一个空的和无法处理的输出，但没有用。代码如下：

import gzip
import numpy as np

"""
Read in a SOFT format data file.  The following values can be exported:

GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample descriptions of length d
X   : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"

## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
    SIF = {}
    for line in fid:
        if line.startswith(line, len("!dataset_table_begin")):
            break
        elif line.startswith(line, len("!subject_description")):
            subset_description = line.split("=")[1].strip()
        elif line.startswith(line, len("!subset_sample_id")):
            subset_ids = [x.strip() for x in subset_ids]
            for k in subset_ids:
                SIF[k] = subset_description
    ## Next line is the column headers (sample id's)
    SID = next(fid).split("\t")

    ## The column indices that contain gene expression data
    I = [i for i,x in enumerate(SID) if x.startswith("GSM")]

    ## Restrict the column headers to those that we keep
    SID = [SID[i] for i in I]

    ## Get a list of sample labels
    STP = [SIF[k] for k in SID]

    ## Read the gene expression data as a list of lists, also get the gene
    ## identifiers
    GID,X = [],[]
    for line in fid:

        ## This is what signals the end of the gene expression data
        ## section in the file
        if line.startswith("!dataset_table_end"):
            break

        V = line.split("\t")

        ## Extract the values that correspond to gene expression measures
        ## and convert the strings to numbers
        x = [float(V[i]) for i in I]

        X.append(x)
        GID.append(V[0] + ";" + V[1])
X = np.array(X)

## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]

## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]

X.shape Out95：(0，)

再次感谢您的建议。

python-3.x

回答 1

Stack Overflow用户

发布于 2017-04-13 16:57:29

这样做效果很好：

    import gzip
    import numpy as np


    """
    Read in a SOFT format data file.  The following values can be exported:

    GID : A list of gene identifiers of length d
    SID : A list of sample identifiers of length n
    STP : A list of sample desriptions of length d
    X   : A dxn array of gene expression values
    """
    #path to the data file
    fname = "../data/GDS1615_full.soft.gz"

    ## Open the data file directly as a gzip file
    with gzip.open(fname) as fid:
        SIF = {}
        for line in fid:
            if line.startswith(b"!dataset_table_begin"):
                break
            elif line.startswith(b"!subset_description"):

                subset_description = line.decode('utf8').split("=")[1].strip() 
            elif line.startswith(b"!subset_sample_id"):
                subset_ids = line.decode('utf8').split("=")[1].split(",")
                subset_ids = [x.strip() for x in subset_ids]
                for k in subset_ids:
                    SIF[k] = subset_description
        ## Next line is the column headers (sample id's)
        SID = next(fid).split(b"\t")
        ## The column indices that contain gene expression data
        I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")]
        ## Restrict the column headers to those that we keep
        SID = [SID[i] for i in I]
        ## Get a list of sample labels   
        STP = [SIF[k.decode('utf8')] for k in SID]
    ## Read the gene expression data as a list of lists, also get the gene
    ## identifiers
    GID,X = [],[]
    for line in fid:
        ## This is what signals the end of the gene expression data
        ## section in the file
        if line.startswith(b"!dataset_table_end"):
            break
        V = line.split(b"\t")
        ## Extract the values that correspond to gene expression measures
        ## and convert the strings to numbers
        x = [float(V[i]) for i in I]
        X.append(x)
        GID.append(V[0].decode() + ";" + V[1].decode())

X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]

结果：

X.shape Out4：(22283,127)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43368128

复制

相似问题

问我如何获得微阵列数据？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何获得微阵列数据？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何获得微阵列数据？
EN