Magic Feature https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600 # frequency encode def encode_FE
* np.log2(fraud_prob + 1e-10) + (1-fraud_prob) * np.log2(1-fraud_prob + 1e-10 = self.risk_system.df[self.risk_system.df['is_fraud'] == 1].corr()['is_fraud'].drop('is_fraud') normal_corr = self.risk_system.df[self.risk_system.df['is_fraud'] == 0].corr()['is_fraud'].drop('is_fraud 真阳性 fp = np.sum(predicted_fraud & ~actual_fraud) # 假阳性 tn = np.sum(~predicted_fraud & ~actual_fraud) # 真阴性 fn = np.sum(~predicted_fraud & actual_fraud) # 假阴性
With the rise of shopping comes a rise in fraud. Experian’s 2022 Future of Fraud Forecast highlights BNPL, Romance, and cryptocurrency schemes on the Without the right identity verification and fraud mitigation tools in place, fraudsters will take advantage Experian predicts BNPL lenders will see an uptick in two types of fraud: identity theft and synthetic identity fraud, when a fraudster uses a combination of real and fake information to create an entirely
data['Amount_max_fraud'] = 1 data.loc[data.Amount <= 2125.87, 'Amount_max_fraud'] = 0 f, (ax1, ax2) . #0.172% of transactions were fraud. Fraud = data[data.Fraud == 1] Normal = data[data.Normal == 1] # Set X_train equal to 80% of the fraudulent X_train = X_train.drop(['Fraud','Normal'], axis = 1) X_test = X_test.drop(['Fraud','Normal'], axis = :', ratio) y_train.Fraud *= ratio y_test.Fraud *= ratio print('训练数据的数量:\n', y_train.Fraud) print('测试数据的数量
However, we can produce tags can only after we have suffered a fraud attack. Additionally, it can track new fraud methods, and constantly adapt to an ever-changing environment to created anti-fraud machine learning models. Therefore, to a certain extent, you can say that risk control and anti-fraud work are universal. Therefore, the cost of committing fraud is comparatively low.
步骤 确定数据不平衡度是多少:通过value_counts()来统计,查看每个类别的数量和占比 在本例中一旦我们确定了fraud的数量,我们就需要将no-fraud的数量采样和其相同,形成50%:50% 取出欺诈的数据,同时从非欺诈中取出相同长度的数据: # 欺诈的数据 fraud_df = df[df["Class"] == 1] # 从非欺诈的数据中取出相同的长度len(fraud_df) no_fraud_df = df[df["Class"] == 0][:len(fraud_df)] # 492+492 normal_distributed_df = pd.concat([fraud_df, no_fraud_df 值越大,结果越可能出现fraud 负相关:特征V17, V14, V12 和 V10 是负相关的;值越小,结果越可能出现fraud 箱型图 In [32]: 负相关的特征箱型图 # 负相关的数据 # 生成 = new_df["V14"].loc[new_df["Class"] == 1] q1, q3 = v14_fraud.quantile(0.25), v14_fraud.quantile(0.75
', description='Features for credit card fraud detection', tags =['fraud', 'credit_card'])# 查询特征registered_features = fr.get_features('fraud_detection_features', version ='1.0')# 特征血缘图lineage_graph = fr.get_lineage('fraud_detection_features')lineage_graph.visualize()3.4 100, 'max_depth': 5})# 绑定特征到模型mt.bind_features(features)# 训练模型training_metrics = mt.train(target='is_fraud ', validation_split=0.2)# 保存模型mt.save('fraud_detection_model', version='1.0
= '低风险' action = '通过' return { 'fraud_score': fraud_score, 'fraud_risk_level': fraud_risk_level, 'action': action, 'z_score' (figsize=(10, 6)) plt.bar(['欺诈风险评分'], [fraud_detection['fraud_score']], color='red' if fraud_detection ['fraud_risk_level'] == '高风险' else 'orange' if fraud_detection['fraud_risk_level'] == '中风险' else 'green ') plt.ylim(0, 1) plt.title(f'交易欺诈检测结果 - {fraud_detection['fraud_risk_level']
'] = performance_df['Amount']*performance_df['Actual'] performance_df['fraud_prevented'] = performance_df ['fraud_amount']*performance_df['Pred'] performance_df['fraud_realized'] = performance_df['fraud_amount '] - performance_df['fraud_prevented'] financial_recall = (performance_df['fraud_prevented'] .sum() / (performance_df['fraud_prevented'].sum() + performance_df['fraud_realized'].sum()))*100 在这种情况下,我们可以像这样向class_weight传递一个字典: fraud_class_weights = {0:1, 1:10} 但是sklearn API实际上使这个过程更容易。
The canonical example is fraud detection, where most transactions aren't fraud, but the cost associated cases than non-fraud cases; this could be due to a business rule, so we might alter how we weigh the However, because we care more about fraud cases, let's oversample the fraud relative to nonfraud cases Put in the context of the problem, if the estimated cost associated with fraud is sufficiently large, it can eclipse the cost associated with tracking fraud.
= np.sum(d['Class'] == 1) plt.bar(['Fraud', 'non-fraud'], [num_fraud, num_nonfraud], color='dodgerblue , axis=1) mse_fraud = np.mean(np.power(X_fraud - pred_fraud, 2), axis=1) mae_test = np.mean(np.abs(X_test - pred_test), axis=1) mae_fraud = np.mean(np.abs(X_fraud - pred_fraud), axis=1) mse_df = pd.DataFrame mse_fraud]) mse_df['MAE'] = np.hstack([mae_test, mae_fraud]) mse_df = mse_df.sample(frac=1).reset_index # 画出MSE、MAE散点图 markers = ['o', '^'] colors = ['dodgerblue', 'coral'] labels = ['Non-fraud', 'Fraud']
下面的代码显示了一种简单的方法: # Shuffle the Dataset. shuffled_df = credit_df.sample(frac=1,random_state=4) # Put all the fraud class in a separate dataset. fraud_df = shuffled_df.loc[shuffled_df['Class'] == 1] #Randomly select 492 observations from the non-fraud (majority class) non_fraud_df = shuffled_df.loc[shuffled_df['Class == 0].sample(n=492,random_state=42) # Concatenate both dataframes again normalized_df = pd.concat([fraud_df , non_fraud_df]) #plot the dataset after the undersampling plt.figure(figsize=(8, 8)) sns.countplot(
Cannabis’, ‘Stimulants’, ‘Hash’] Packstation24 [‘Accounts’, ‘Benzos’, ‘IDs & Passports’, ‘SIM Cards’, ‘Fraud Stimulants’, ‘Prescription’, ‘Sildenafil Citrate’] OzVendor [‘Software’, ‘Erotica’, ‘Dumps’, ‘E-Books’, ‘Fraud ‘Stimulants’] [‘MDMA’] 310 0.768 [‘Speed’, ‘Weed’, ‘Stimulants’] [‘Cannabis’, ‘Ecstasy’] 68 0.623 [‘Fraud ’, ‘Hacking’] [‘Accounts’] 53 0.623 [‘Fraud’, ‘CC & CVV’, ‘Accounts’] [‘Paypal’] 43 0.492 [‘Documents
creditcard_data['V3']<-5), 1, 0) print(pd.crosstab(creditcard_data['Class'], creditcard_data['flag_as_fraud '], rownames=['Actual Fraud'], colnames=['Flagged Fraud'])) Flagged Fraud 0 1 Actual Fraud y_train) predictions = lr.predict(X_test) print(pd.crosstab(y_test, predictions, rownames=['Actual Fraud '], colnames=['Flagged Fraud'])) Flagged Fraud 0.0 1.0 Actual Fraud 0.0 1504 '], colnames=['Flagged Fraud'])) Flagged Fraud 0.0 1.0 Actual Fraud 0.0 1496
这里 https://github.com/wmlba/innovate2019/blob/master/Credit_Card_Fraud_Detection.ipynb 一、 重采样(过采样和欠采样 一个简单实现代码如下: # Shuffle the Dataset. shuffled_df = credit_df.sample(frac=1,random_state=4) # Put all the fraud class in a separate dataset. fraud_df = shuffled_df.loc[shuffled_df['Class'] == 1] #Randomly select 492 observations from the non-fraud (majority class) non_fraud_df=shuffled_df.loc[shuffled_df['Class' , non_fraud_df]) #plot the dataset after the undersampling plt.figure(figsize=(8, 8)) sns.countplot('
df_total['type'].unique() df_fraud = df_total[df_total['isFraud']==1] df_fraud['type'].unique() 我们分别统计正常交易和诈骗交易在不同交易类型下的数量 df_fraudTransfer = df_fraud[df_fraud['type'] == 'TRANSFER'] df_fraudCashout = df_fraud[df_fraud['type [(X_fraud['oldbalanceDest'] == 0) & (X_fraud['newbalanceDest'] == 0) & (X_fraud['amount']! print('交易前后来源方余额都是0,而这笔交易本身不为0情况下') print('金融诈骗中发生的比率:\t {}'.\ format(len(X_fraud.loc[(X_fraud['oldbalanceOrg '] == 0) & (X_fraud['newbalanceOrig'] == 0) & (X_fraud['amount']!
= np.random.choice(n_samples, 50, replace=False)data['refund_amount'][fraud_indices] = np.random.uniform '][fraud_indices] = np.random.choice([0, 1, 2, 3, 23], 50)data['device_entropy'][fraud_indices] = np.random.uniform (0.01, 0.1, 50)data['ip_distance'][fraud_indices] = np.random.uniform(500, 2000, 50)data['typing_variance '][fraud_indices] = np.random.uniform(0.01, 0.05, 50)df = pd.DataFrame(data)# 1. ring in fraud_rings: for node_idx in ring: df.loc[node_idx, 'in_fraud_ring'] = True# 综合风险评分
11 to May 18) to the reopening phase (May 19-July 25), noted the quarterly report on global online fraud almost overnight, fraudsters tried to take advantage," Shai Cohen, senior vice president of global fraud As those businesses ramped up their digital fraud prevention solutions, he continued, the fraudsters
sns.distplot(v12_fraud, ax=ax2, fit=norm, color="#56F9BB") ax2 .set_title("V12", fontsize=14) v10_fraud = new_df["V10"].loc[new_df["Class"] == 1].values sns.distplot (v10_fraud, ax=ax3, fit=norm, color="#C5B3F9") ax2.set_title(" = df[df["Class"] == 1] # 少量数据 # 从非欺诈的数据中取出相同的长度len(fraud_df) no_fraud_df = df[df["Class"] == 0][:len (fraud_df)] # 组合 normal_distributed_df = pd.concat([fraud_df, no_fraud_df]) # 随机打乱数据 new_df = normal_distributed_df.sample
一种简单的方法如下面的代码所示: # Shuffle the Dataset. shuffled_df = credit_df.sample(frac=1,random_state=4) # Put all the fraud class in a separate dataset. fraud_df = shuffled_df.loc[shuffled_df['Class'] == 1] #Randomly select 492 observations from the non-fraud (majority class) non_fraud_df = shuffled_df.loc[shuffled_df['Class == 0].sample(n=492,random_state=42) # Concatenate both dataframes again normalized_df = pd.concat([fraud_df , non_fraud_df]) #plot the dataset after the undersampling plt.figure(figsize=(8, 8)) sns.countplot(