文章/答案/技术大牛

发布

社区首页 >问答首页 >对大型数据集(2亿×2个变量)运行逻辑回归的有效方法是什么？

问对大型数据集(2亿×2个变量)运行逻辑回归的有效方法是什么？
EN

Stack Overflow用户

提问于 2014-07-30 20:32:59

回答 1查看 3.3K关注 0票数 5

我目前正在尝试运行一个逻辑回归模型。我的数据有两个变量，一个响应变量和一个预测变量。问题是我有两亿的观测结果。我试图运行一个逻辑回归模型，但在R/Stata/MATLAB中，即使在亚马逊上的EC2实例的帮助下，也很难这样做。我认为问题在于逻辑回归函数是如何在语言本身中定义的。还有其他方法可以快速运行逻辑回归吗？目前，我遇到的问题是，我的数据很快就填满了它正在使用的任何空间。我甚至尝试使用高达30 GB的RAM，但没有效果。任何解决办法都将受到极大欢迎。

python

matlab

hadoop

stata

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-09-19 05:55:36

如果您的主要问题是在计算机内存约束下估计logit模型的能力，而不是估算的快速性，您可以利用最大似然估计的可加性，并为毫升编写自定义程序。logit模型是利用logistic分布的极大似然估计。事实上，只有一个自变量简化了这个问题。我模拟了下面的问题。您应该在下面的代码块中创建两个do文件。

如果您在整个数据集中没有问题加载--您不应该这样做--我的模拟只使用了大约2千兆内存，使用了2亿obs和2 vars，尽管里程可能不同--第一步是将数据集分解成可管理的部分。例如：

=因变量(0或1s) indepvar =自变量(一些数值数据类型)

cd "/path/to/largelogit"

clear all
set more off

set obs 200000000

// We have two variables, and independent variable and a dependent variable.
gen indepvar = 10*runiform()
gen depvar = .

// As indpevar increases, the probability of depvar being 1 also increases.
replace depvar = 1 if indepvar > ( 5 + rnormal(0,2) )
replace depvar = 0 if depvar == .

save full, replace
clear all

// Need to split the dataset into managable pieces

local max_opp = 20000000    // maximum observations per piece

local obs_num = `max_opp'

local i = 1
while `obs_num' == `max_opp' {

    clear

    local h = `i' - 1

    local obs_beg = (`h' * `max_opp') + 1
    local obs_end = (`i' * `max_opp')

    capture noisily use in `obs_beg'/`obs_end' using full

    if _rc == 198 {
        capture noisily use in `obs_beg'/l using full
    }
    if _rc == 198 { 
        continue,break
    }

    save piece_`i', replace

    sum
    local obs_num = `r(N)'

    local i = `i' + 1

}

从这里尽量减少内存使用，关闭Stata并重新打开它。当您创建如此大的数据集时，Stata保留了一些分配给开销的内存，即使您清除了dataset。您可以在memory之后和clear all之后键入save full以了解我的意思。

接下来，您必须定义您自己的自定义ml程序，它将在程序中一次一个地输入这些片段中的每个片段，计算并汇总每个观察到的每个片段的日志概率，并将它们相加在一起。您需要使用d0 ml method而不是lf方法，因为使用lf的优化例程需要将所有数据加载到Stata中。

clear all
set more off

cd "/path/to/largelogit"

// This local stores the names of all the pieces 
local p : dir "/path/to/largelogit" files "piece*.dta"

local i = 1
foreach j of local p {    // Loop through all the names to count the pieces

    global pieces = `i'    // This is important for the program
    local i = `i' + 1

}

// Generate our custom MLE logit progam. This is using the d0 ml method 

program define llogit_d0

    args todo b lnf 

    tempvar y xb llike tot_llike it_llike

quietly {

    forvalues i=1/$pieces {

        capture drop _merge
        capture drop depvar indepvar
        capture drop `y'
        capture drop `xb'
        capture drop `llike' 
        capture scalar drop `it_llike'

        merge 1:1 _n using piece_`i'

        generate int `y' = depvar

        generate double `xb' = (indepvar * `b'[1,1]) + `b'[1,2]    // The linear combination of the coefficients and independent variable and the constant

        generate double `llike' = .

        replace `llike' = ln(invlogit( `xb')) if `y'==1    // the log of the probability should the dependent variable be 1
        replace `llike' = ln(1-invlogit(`xb')) if `y'==0   // the log of the probability should the dependent variable be 0

        sum `llike' 
        scalar `it_llike' = `r(sum)'    // The sum of the logged probabilities for this iteration

        if `i' == 1     scalar `tot_llike' = `it_llike'    // Total log likelihood for first iteration
        else            scalar `tot_llike' = `tot_llike' + `it_llike' // Total log likelihood is the sum of all the iterated log likelihoods `it_llike'

    }

    scalar `lnf' = `tot_llike'   // The total log likelihood which must be returned to ml

}

end

//This should work

use piece_1, clear

ml model d0 llogit_d0 (beta : depvar = indepvar )
ml search
ml maximize

我刚刚运行了上述两段代码，并收到了以下输出：

这种方法的优点和作用：

专业：

`max_opp的大小越小，内存使用率就越低。我从没有像上面那样在模拟器上使用过超过1G的。-最终得到无偏估计器，整个数据集估计器的全对数可能性，正确的标准误差--基本上所有对推理都很重要的东西。

Con：

你在内存中节省的东西，你必须牺牲CPU的时间。我用带有i5处理器的Stata (一个核心)在我的个人笔记本上运行这个程序，花了我一夜之间的时间。Wald Chi2的统计数据是错误的，但是我相信你可以根据上面提到的正确数据来计算它--你不会像用logit那样得到Psudo R2。

要测试系数是否真的与标准logit相同，请将set obs设置为相对较小的100000，并将max_opp设置为大约1000。运行我的代码，查看输出，运行logit depvar indepvar，查看输出，它们与我在上面的"Cons“中提到的相同。将obs设置为与max_opp相同将更正Wald Chi2统计信息。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25046395

复制

相似问题

问对大型数据集(2亿×2个变量)运行逻辑回归的有效方法是什么？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对大型数据集(2亿×2个变量)运行逻辑回归的有效方法是什么？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问对大型数据集(2亿×2个变量)运行逻辑回归的有效方法是什么？
EN