首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何将线性回归应用到包含NaNs的大型多维阵列中的每个像素上?

如何将线性回归应用到包含NaNs的大型多维阵列中的每个像素上?
EN

Stack Overflow用户
提问于 2018-08-31 04:26:40
回答 4查看 4.9K关注 0票数 18

我有一个自变量值(x_array)的一维数组,它与多个时间步骤(y_array)空间数据的三维numpy数组中的时间步骤相匹配。我的实际数据要大得多: 300+时间步骤和最多3000 * 3000像素:

代码语言:javascript
复制
import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

我想要计算一个每像素线性回归,并获得R-平方,P-值,截取和斜率的每个xy像素在y_array,在x_array中的每一个时间步骤的值作为我的自变量。

我可以对数据进行整形,以一种形式将其输入到np.polyfit中,这是矢量化和快速的:

代码语言:javascript
复制
# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

但是,这会忽略包含任何NaN值的像素(np.polyfit不支持NaN值),也不会计算我所需的统计信息(R-平方、P-值、截取和斜率)。

回答这里使用scipy.stats import linregress来计算我需要的统计信息,并建议通过屏蔽这些NaN值来避免NaN问题。但是,这个示例是针对两个一维数组的,我无法解决如何将类似的掩蔽方法应用到我的情况中,即y_array_reshaped中的每一列都有不同的NaN值集。

我的问题:如何计算一个大的多维数组(300x3000x3000)中的每个像素的回归统计数据,其中包含了许多NaN值,并以一种合理的、矢量化的方式进行了计算?

期望的结果:A3x3回归统计值数组(例如R-平方)用于y_array中的每个像素,即使该像素在时间序列中的某个点包含NaN

EN

回答 4

Stack Overflow用户

回答已采纳

发布于 2018-10-08 22:53:32

上述评论中提到的这篇博客文章包含了一个非常快的矢量化函数,用于Python中的多维数据的互相关、协方差和回归。它产生了我需要的所有回归输出,并以毫秒为单位完成,因为它完全依赖于xarray中简单的矢量数组操作。

https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

我做了一个小改动(#3之后的第一行),以确保函数正确地计算每个像素中不同数量的NaN值:

代码语言:javascript
复制
def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr
票数 8
EN

Stack Overflow用户

发布于 2018-08-31 07:12:05

我不确定这将如何扩展(也许您可以使用达斯克),但下面是使用应用方法对熊猫DataFrame这样做的非常简单的方法:

代码语言:javascript
复制
import pandas as pd
import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

def lin_regress(col):
    "Mask nulls and apply stats.linregress"
    col = col.loc[~pd.isnull(col)]
    return linregress(col.index.tolist(), col)

# Build the DataFrame (each index represents a pixel)
df = pd.DataFrame(y_array.reshape(len(y_array), -1), index=x_array.tolist())

# Apply a our custom linregress wrapper to each function, split the tuple into separate columns
final_df = df.apply(lin_regress).apply(pd.Series)

# Name the index and columns to make this easier to read
final_df.columns, final_df.index.name = 'slope, intercept, r_value, p_value, std_err'.split(', '), 'pixel_number'

print(final_df)

输出:

代码语言:javascript
复制
                 slope  intercept   r_value       p_value   std_err
pixel_number                                                       
0             0.071429  -0.192857  0.188982  8.789623e-01  0.371154
1            -0.071429  -0.207143 -0.188982  8.789623e-01  0.371154
2             0.357143  -0.464286  0.944911  2.122956e-01  0.123718
3             0.105263  -0.289474  0.229416  7.705843e-01  0.315789
4             1.000000  -0.700000  1.000000  9.003163e-11  0.000000
5            -0.285714  -0.328571 -0.188982  8.789623e-01  1.484615
6             0.105263  -0.289474  0.132453  8.675468e-01  0.557000
7            -0.285714  -0.228571 -0.755929  4.543711e-01  0.247436
8             0.071429  -0.392857  0.188982  8.789623e-01  0.371154
票数 5
EN

Stack Overflow用户

发布于 2019-06-17 14:21:42

这里给出的答案是绝对好的,因为它主要利用了https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html矢量化和广播的强大力量,但它假设要分析的数据是完整的,这在实际的研究周期中通常不是这样的。上面的一个答案旨在解决缺失的数据问题,但我个人认为需要更新更多的代码,因为如果数据中有nan,np.mean()将返回nan。幸运的是,numpy为我们提供了nanmean()nanstd()等,通过忽略数据中的nans来计算平均值、标准错误等等。同时,原博客中的程序以netCDF格式的数据为目标。有些人可能不知道这一点,但更熟悉原始的numpy.array格式。因此,我在这里提供了一个代码示例,展示了如何计算两个三维数组之间的协方差、相关系数等(n-D维是同一逻辑)。请注意,为了方便起见,我让x_array作为y_array第一个维度的索引,但在实际分析中,x_array肯定可以从外部读取。

代码

代码语言:javascript
复制
def linregress_3D(y_array):
    # y_array is a 3-D array formatted like (time,lon,lat)
    # The purpose of this function is to do linear regression using time series of data over each (lon,lat) grid box with consideration of ignoring np.nan
    # Construct x_array indicating time indexes of y_array, namely the independent variable.
    x_array=np.empty(y_array.shape)
    for i in range(y_array.shape[0]): x_array[i,:,:]=i+1 # This would be fine if time series is not too long. Or we can use i+yr (e.g. 2019).
    x_array[np.isnan(y_array)]=np.nan
    # Compute the number of non-nan over each (lon,lat) grid box.
    n=np.sum(~np.isnan(x_array),axis=0)
    # Compute mean and standard deviation of time series of x_array and y_array over each (lon,lat) grid box.
    x_mean=np.nanmean(x_array,axis=0)
    y_mean=np.nanmean(y_array,axis=0)
    x_std=np.nanstd(x_array,axis=0)
    y_std=np.nanstd(y_array,axis=0)
    # Compute co-variance between time series of x_array and y_array over each (lon,lat) grid box.
    cov=np.nansum((x_array-x_mean)*(y_array-y_mean),axis=0)/n
    # Compute correlation coefficients between time series of x_array and y_array over each (lon,lat) grid box.
    cor=cov/(x_std*y_std)
    # Compute slope between time series of x_array and y_array over each (lon,lat) grid box.
    slope=cov/(x_std**2)
    # Compute intercept between time series of x_array and y_array over each (lon,lat) grid box.
    intercept=y_mean-x_mean*slope
    # Compute tstats, stderr, and p_val between time series of x_array and y_array over each (lon,lat) grid box.
    tstats=cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
    stderr=slope/tstats
    from scipy.stats import t
    p_val=t.sf(tstats,n-2)*2
    # Compute r_square and rmse between time series of x_array and y_array over each (lon,lat) grid box.
    # r_square also equals to cor**2 in 1-variable lineare regression analysis, which can be used for checking.
    r_square=np.nansum((slope*x_array+intercept-y_mean)**2,axis=0)/np.nansum((y_array-y_mean)**2,axis=0)
    rmse=np.sqrt(np.nansum((y_array-slope*x_array-intercept)**2,axis=0)/n)
    # Do further filteration if needed (e.g. We stipulate at least 3 data records are needed to do regression analysis) and return values
    n=n*1.0 # convert n from integer to float to enable later use of np.nan
    n[n<3]=np.nan
    slope[np.isnan(n)]=np.nan
    intercept[np.isnan(n)]=np.nan
    p_val[np.isnan(n)]=np.nan
    r_square[np.isnan(n)]=np.nan
    rmse[np.isnan(n)]=np.nan
    return n,slope,intercept,p_val,r_square,rmse

样本输出

我用这个程序测试了两个227x3601x6301像素的三维阵列,它在20分钟内完成了工作,每个都不到10分钟。

票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/52108417

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档