首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >我如何获得微阵列数据?

我如何获得微阵列数据?
EN

Stack Overflow用户
提问于 2017-04-12 19:20:56
回答 1查看 246关注 0票数 0

谢谢你的帮助。我想使用以下python代码来读取和处理来自affymetrix微阵列数据集的数据。我想阐明克罗恩病和溃疡性结肠炎在单个核细胞中的差异基因表达。代码运行得很好,但是当我尝试查看X的内容时,我在输出中得到一个空数组(如: array([],dtype=float64)),这当然没有用。这里有一个到原始数据集的链接:https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615我一直在尝试找出为什么我有一个空的和无法处理的输出,但没有用。代码如下:

代码语言:javascript
复制
import gzip
import numpy as np

"""
Read in a SOFT format data file.  The following values can be exported:

GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample descriptions of length d
X   : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"

## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
    SIF = {}
    for line in fid:
        if line.startswith(line, len("!dataset_table_begin")):
            break
        elif line.startswith(line, len("!subject_description")):
            subset_description = line.split("=")[1].strip()
        elif line.startswith(line, len("!subset_sample_id")):
            subset_ids = [x.strip() for x in subset_ids]
            for k in subset_ids:
                SIF[k] = subset_description
    ## Next line is the column headers (sample id's)
    SID = next(fid).split("\t")

    ## The column indices that contain gene expression data
    I = [i for i,x in enumerate(SID) if x.startswith("GSM")]

    ## Restrict the column headers to those that we keep
    SID = [SID[i] for i in I]

    ## Get a list of sample labels
    STP = [SIF[k] for k in SID]

    ## Read the gene expression data as a list of lists, also get the gene
    ## identifiers
    GID,X = [],[]
    for line in fid:

        ## This is what signals the end of the gene expression data
        ## section in the file
        if line.startswith("!dataset_table_end"):
            break

        V = line.split("\t")

        ## Extract the values that correspond to gene expression measures
        ## and convert the strings to numbers
        x = [float(V[i]) for i in I]

        X.append(x)
        GID.append(V[0] + ";" + V[1])
X = np.array(X)

## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]

## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]

X.shape Out95:(0,)

再次感谢您的建议。

EN

回答 1

Stack Overflow用户

发布于 2017-04-13 16:57:29

这样做效果很好:

代码语言:javascript
复制
    import gzip
    import numpy as np


    """
    Read in a SOFT format data file.  The following values can be exported:

    GID : A list of gene identifiers of length d
    SID : A list of sample identifiers of length n
    STP : A list of sample desriptions of length d
    X   : A dxn array of gene expression values
    """
    #path to the data file
    fname = "../data/GDS1615_full.soft.gz"

    ## Open the data file directly as a gzip file
    with gzip.open(fname) as fid:
        SIF = {}
        for line in fid:
            if line.startswith(b"!dataset_table_begin"):
                break
            elif line.startswith(b"!subset_description"):

                subset_description = line.decode('utf8').split("=")[1].strip() 
            elif line.startswith(b"!subset_sample_id"):
                subset_ids = line.decode('utf8').split("=")[1].split(",")
                subset_ids = [x.strip() for x in subset_ids]
                for k in subset_ids:
                    SIF[k] = subset_description
        ## Next line is the column headers (sample id's)
        SID = next(fid).split(b"\t")
        ## The column indices that contain gene expression data
        I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")]
        ## Restrict the column headers to those that we keep
        SID = [SID[i] for i in I]
        ## Get a list of sample labels   
        STP = [SIF[k.decode('utf8')] for k in SID]
    ## Read the gene expression data as a list of lists, also get the gene
    ## identifiers
    GID,X = [],[]
    for line in fid:
        ## This is what signals the end of the gene expression data
        ## section in the file
        if line.startswith(b"!dataset_table_end"):
            break
        V = line.split(b"\t")
        ## Extract the values that correspond to gene expression measures
        ## and convert the strings to numbers
        x = [float(V[i]) for i in I]
        X.append(x)
        GID.append(V[0].decode() + ";" + V[1].decode())

X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]

结果:

X.shape Out4:(22283,127)

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/43368128

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档