这个问题参考了“数据科学家的O‘’Relly实用统计学第二版”一书的第三章,会话卡方检验。
这本书提供了一个卡方测试案例的例子,它假设一个网站有三个不同的标题,有1000名访问者。结果显示了每个标题的点击量。
观测数据如下:
Headline A B C
Click 14 8 12
No-click 986 992 988期望值的计算方法如下:
Headline A B C
Click 11.13 11.13 11.13
No-click 988.67 988.67 988.67皮尔逊残差定义为:

表现在所在的位置:
Headline A B C
Click 0.792 -0.990 0.198
No-click -0.085 0.106 -0.021卡方统计量是皮尔逊残差的平方和:

。是1.666
到目前一切尚好。现在来看重采样部分:
1. Assuming a box of 34 ones and 2966 zeros
2. Shuffle, and take three samples of 1000 and count how many ones(Clicks)
3. Find the squared differences between the shuffled counts and expected counts then sum them.
4. Repeat steps 2 to 3, a few thousand times.
5. The P-value is how often does the resampled sum of squared deviations exceed the observed.书中提供的重采样python测试代码如下:(可从https://github.com/gedeck/practical-statistics-for-data-scientists/tree/master/python/code下载)
## Practical Statistics for Data Scientists (Python)
## Chapter 3. Statistial Experiments and Significance Testing
# > (c) 2019 Peter C. Bruce, Andrew Bruce, Peter Gedeck
# Import required Python packages.
from pathlib import Path
import random
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import power
import matplotlib.pylab as plt
DATA = Path('.').resolve().parents[1] / 'data'
# Define paths to data sets. If you don't keep your data in the same directory as the code, adapt the path names.
CLICK_RATE_CSV = DATA / 'click_rates.csv'
...
## Chi-Square Test
### Chi-Square Test: A Resampling Approach
# Table 3-4
click_rate = pd.read_csv(CLICK_RATE_CSV)
clicks = click_rate.pivot(index='Click', columns='Headline', values='Rate')
print(clicks)
# Table 3-5
row_average = clicks.mean(axis=1)
pd.DataFrame({
'Headline A': row_average,
'Headline B': row_average,
'Headline C': row_average,
})
# Resampling approach
box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)
def chi2(observed, expected):
pearson_residuals = []
for row, expect in zip(observed, expected):
pearson_residuals.append([(observe - expect) ** 2 / expect
for observe in row])
# return sum of squares
return np.sum(pearson_residuals)
expected_clicks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34 / 3, 1000 - 34 / 3]
chi2observed = chi2(clicks.values, expected)
def perm_fun(box):
sample_clicks = [sum(random.sample(box, 1000)),
sum(random.sample(box, 1000)),
sum(random.sample(box, 1000))]
sample_noclicks = [1000 - n for n in sample_clicks]
return chi2([sample_clicks, sample_noclicks], expected)
perm_chi2 = [perm_fun(box) for _ in range(2000)]
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')现在,我运行了2,000次perm_fun(盒子),并获得了重新采样的P值0.4775。但是,如果我运行perm_fun(box) 10,000次和100,000次,我都能够获得0.84的重新采样的P值。在我看来,P值应该在0.84左右。为什么stats.chi2_contigency显示的数字这么小?
我跑了2000次得到的结果是:
Observed chi2: 1.6659
Resampled p-value: 0.8300
Observed chi2: 1.6659
p-value: 0.4348如果我运行10,000次,结果是:
Observed chi2: 1.6659
Resampled p-value: 0.8386
Observed chi2: 1.6659
p-value: 0.4348软件版本:
pandas.__version__: 0.25.1
numpy.__version__: 1.16.5
scipy.__version__: 1.3.1
statsmodels.__version__: 0.10.1
sys.version_info: 3.7.4发布于 2020-11-27 11:19:22
我尝试了2000次、10000次和100000次循环来运行您的代码,这三次我都接近了.47。然而,我确实在这一行得到了一个错误,我必须修复它:
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)这里的perm_chi2是一个列表,而chi2observed是一个浮点数,所以我想知道这段代码是如何为您运行的(也许您为修复它所做的任何事情都是错误的根源)。在任何情况下,将其更改为预期的
resampled_p_value = sum([1*(x > chi2observed) for x in perm_chi2]) / len(perm_chi2)允许我运行它并接近.47。
确保在更改迭代次数时,仅通过更改2000来完成此操作,而不更改其他数字。
https://stackoverflow.com/questions/64600459
复制相似问题