我使用XGBClassifier来预测用户是否会点击广告。
我正在寻求建议,以增加我对少数族裔班级的召回。
关于我的数据的:
1. Total rows: 1,266,267
2. Total clicks: 1960 rows (0.15%) => imbalanced dataset
3. Features used:
- Num of views
- Device used
- Time (categorized into 6 buckets)
- Ad category
- Site id (there are 338 unique site id)
- User features (there are 583 unique features)(Note: features available for 60% of the data)经过一次热编码后,总列/特性为943。
最后数据以稀疏矩阵格式。
模型结果:
Model | AUC | Logloss | Recall* | Precision*
------------------------|--------|---------|---------|-----------
Using all 943 features | 0.7359 | 0.05392 | 0.47 | 0.85
----------------------------------------------------------------
Clustered user features | 0.7548 | 0.05470 | 0.51 | 0.80
into groups |
Final model features |
num=361 |
----------------------------------------------------------------
*recall and precision refers to the minority class (click=1).
**recall, precision for majority class (click=0) is 1.为了在不平衡的数据集中增加召回,我尝试了:
1. Undersampling (highest recall 0.92, but precision at 0.03)
2. SMOTE (highest recall 0.77, but precision at 0.05)
3. Different algorithm (best is XGBoost)
4. Hyperparameter tuning (recall increased by 0.01)问题:
0.7到0.82。但是,他们都没有显示召回/混淆矩阵。对于任何之前做过CTR预测的人,我能知道你的召回/混淆矩阵吗?发布于 2020-10-28 14:28:04
此外,将建议尝试其他事情,例如:如果您的数据不平衡,使用SMOTE (50%/50%)平衡,如果您有许多分类的variables...etc,则尝试其他类型的编码。
发布于 2022-08-05 16:55:53
尝试增加少数类样本的sample_weight
clf.fit(X_train, Y_train,sample_weight=sample_weight)
其中sample_weight是一个双数数组,长度等于样本数。
https://stackoverflow.com/questions/54086757
复制相似问题