Sex 性别,男性为male,女性为female Age 年龄,不满1岁的年龄为小数 SibSp 该乘客同船的兄弟姐妹及配偶的数量 Parch 该乘客同船的父母以及儿女的数量 Ticket 船票编号 Fare 可以先把要保留的字段做成一个列表,然后再把列表保留出来 cols = ['survived', 'name', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare ', 'embarked'] #cols = ['survived', 'name', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare'] all_data = df['fare'].mean() df['fare'] = df['fare'].fillna(fare_mean) df['sex'] = df['sex'].map({'female = df['fare'].mean() df['fare'] = df['fare'].fillna(fare_mean) 对于性别这一列,使用 map 方法,将 female 设置为 0,male
01 def replace_fare(fare): if fare == 'Adulto ida': return 1 elif fare == 'Promo +': return 2 elif fare == 'Promo': return 3 elif fare == 'Flexible': return 4 elif fare == 'Individual-Flexible': return 5 elif fare == 'Mesa': return 6 data['fare_encode'] = data['fare'].apply(lambda x: replace_fare(x)) label_encoder = preprocessing.LabelEncoder fare_like = pm.Normal('fare_like', mu=fare_est, sd=eps, observed=data.price) with hierarchical_model
我们将票价收入不到10美元,10美元到20美元,20美元到30美元以及30美元以上,并将其存储到一个新变量中: > train$Fare2 <- '30+' > train$Fare2[train$Fare < 30 & train$Fare >= 20] <- '20-30' > train$Fare2[train$Fare < 20 & train$Fare >= 10] <- '10-20' > train$Fare2[train$Fare < 10] <- '<10' 现在让我们运行一个更长的聚合函数,看看这里有什么有趣的东西: > aggregate(Survived ~ Fare2 + Pclass + Sex, data=train, FUN=function(x) {sum(x)/length(x)}) Fare2 Pclass Sex Survived 1 20-30 1 female Survived[test$Sex == 'female'] <- 1 > test$Survived[test$Sex == 'female' & test$Pclass == 3 & test$Fare
= train["Fare"].mean() test.loc[pd.isnull(test.Fare),'Fare'] = fare_mean embarked_mode = train['Embarked '] = X_train['Fare'] + 1 X_test['Fare'] = X_test['Fare'] + 1 X_train['Fare'] = X_train['Fare'].apply _subplots.AxesSubplot at 0x1a2627ac50> train['Fare'] = train['Fare'] + 1 test['Fare'] = test['Fare'] + 1 train['Fare'] = train['Fare'].apply(np.log) test['Fare'] = test['Fare'].apply(np.log) train['Fare (test['Fare'] <= 3) ) test['Fare_3_4'] = np.int32((test['Fare'] > 3) & (test['Fare'] <= 4) ) test['Fare
缺失值填充中位数 # fill missing Fare with median fare for each Pclass train["Fare"].fillna(train.groupby("Pclass ")["Fare"].transform("median"), inplace=True) # 对Fare属性进行缺失值填充:通过 Pclass 属性分组,指定Fare的中位数填充缺失值 test[ 将Fare属性分段 for dataset in train_test_data: dataset.loc[ dataset['Fare'] <= 17, 'Fare'] = 0, dataset.loc [(dataset['Fare'] > 17) & (dataset['Fare'] <= 30), 'Fare'] = 1, dataset.loc[(dataset['Fare'] > 30 ) & (dataset['Fare'] <= 100), 'Fare'] = 2, dataset.loc[ dataset['Fare'] > 100, 'Fare'] = 3 Cabin
与乘客年龄和性别的散点图 lm = sns.lmplot(x = 'Age',y = 'Fare',data = titanic,hue = 'Sex' , fit_reg=False) lm.set (title = 'Fare x Age') axes = lm.axes axes[0,0].set_ylim(-5,) axes[0,0].set_xlim(-5,85) plt.show() 有多少人生还 print(titanic.Survived.sum()) 绘制一个展示船票价格的直方图 df = titanic.Fare.sort_values(ascending=False) print( df) binsVal = np.arange(0,600,10) print(binsVal) plt.hist(df,bins= binsVal) plt.xlabel("Fare") plt.ylabel ('Frequency') plt.title("Fare Payed Histrogram") plt.show() 本文由 所长 创作,采用 知识共享署名4.0 国际许可协议进行许可 本站文章除注明转载
Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = 列和Age连续型数值变为离散型值(做分类) for dataset in full_data: dataset.loc[ dataset['Fare'] <= 7.91, 'Fare' ] = 0 dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1 dataset.loc [(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2 dataset.loc[ dataset['Fare' ] > 31, 'Fare'] = 3 dataset['Fare'] = dataset['Fare'].astype(int) dataset.loc[ dataset['Age']
0]fare_survived = train_data['Fare'][train_data['Survived'] == 1]average_fare = pd.DataFrame([fare_not_survived.mean (), fare_survived.mean()])std_fare = pd.DataFrame([fare_not_survived.std(), fare_survived.std()])average_fare.plot 下面的代码对Fare进行Binning。 # Divide all fares into quartilestrain_data['Fare_bin'] = pd.qcut(train_data['Fare'], 5)train_data['Fare_bin 'Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))train_data = pd.concat([train_data, fare_bin_dummies_df
column and create a new feature CategoricalFare for dataset in full_data: dataset['Fare'] = dataset ['Fare'].fillna(train['Fare'].median()) train['CategoricalFare'] = pd.qcut(train['Fare'], 4) # Create dataset.loc[dataset['Fare'] <= 7.91, 'Fare'] = 0 dataset.loc[(dataset['Fare'] > 7.91) & (dataset ['Fare'] <= 14.454), 'Fare'] = 1 dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31) , 'Fare'] = 2 dataset.loc[dataset['Fare'] > 31, 'Fare'] = 3 dataset['Fare'] = dataset['Fare']
combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass ']).transform('count')combined_train_test['Fare'] = combined_train_test['Fare'] / combined_train_test _mean_fare, pclass2_mean_fare, pclass3_mean_fare): if df['Pclass'] == 1: if df['Fare'] <= (pclass_fare_category, args=( Pclass1_mean_fare, Pclass2_mean_fare, Pclass3_mean_fare), axis=1)pclass_level ['Age','Fare', 'Name_length']])combined_train_test[['Age','Fare', 'Name_length']] = scale_age_fare.transform
Query data 根据条件查询数据,fare大于20,查询到6条数据,如下: 5. Update data 根据条件,更新rider = 'rider-D'的数据中fare为25,如下: 6. Merging Data 创建fare_adjustment表,插入4条数据。 然后,将fare_adjustment与hudi_table表进行merge,且根据uuid将两表fare求和,赋值到hudi_table表的fare列。 hudi_table原表8条数据,fare_adjustment表4条数据。 其中,uuid相同的数据为2条,所以merge后共10(8+4-2)条数据,这条数据的fare值为两表fare求和之后的值,如下: 7.
变量,由下图可知,Fare越大,幸存率越高。 is.na(data$Fare)) & row(data[, 'Fare']) <= 891, ], aes(x = Fare, color=Survived)) + geom_line(aes(label =..count..), stat = 'bin', binwidth=10) + labs(title = "How Fare impact survivor", x = "Fare", y = 这里使用乘客的Fare中位数填补缺失值。 1 data$Fare[is.na(data$Fare)] <- median(data$Fare, na.rm=TRUE) 将缺失的Cabin设置为默认值 缺失Cabin信息的记录数较多,不适合使用中位数或者平均值填补
basePath + "/*/*/*/*") tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") spark.sql("select fare , begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show() spark.sql("select _hoodie_commit_time , _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show() , begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show() 这表示查询在开始时间提交之后的所有变更 , begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show() 7.
Sex 'numeric', # Age 'integer', # SibSp 'integer', # Parch 'character', # Ticket 'numeric', # Fare 接下来先简单分析一下数据,比如生还与死亡的比例,不同等级(Pclass)对生还有无影响,我们知道老人,妇女,小孩会优先所以看看性别,年龄是否有影响,此外票的价钱有关吗(fare)? 另外这里还计算了3等舱的价格的median,因为我们发现测试集里面有一个记录的Fare值缺失。 ,pclass,pclass.filter) { for(v in pclass.filter) { fare[is.na(fare)]=median(fare[pclass==v],na.rm =T) } return(fare)}pclass.filter=c(1,2,3)train.raw$fare=imputeFare(train.raw$Fare,train.raw$Pclass
按照GroupFare特征的范围将Fare分成10组: train.loc[train['Fare'] < 60, 'Fare'] = 0 train.loc[(train['Fare'] >= 60) & (train['Fare'] < 120), 'Fare'] = 1 train.loc[(train['Fare'] >= 120) & (train['Fare'] < 180), 'Fare '] = 2 train.loc[(train['Fare'] >= 180) & (train['Fare'] < 240), 'Fare'] = 3 train.loc[(train['Fare'] ['Fare'] >= 420) & (train['Fare'] < 480), 'Fare'] = 7 train.loc[(train['Fare'] >= 480) & (train['Fare '] < 540), 'Fare'] = 8 train.loc[(train['Fare'] >= 540) & (train['Fare'] < 600), 'Fare'] = 9 train.Fare.unique
The Columns we'll use to predict the target predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare 'test.csv') titanic_test['Age'] = titanic_test['Age'].fillna(titanic['Age'].median()) titanic_test['Fare '] = titanic_test['Fare'].fillna(titanic_test['Fare'].median()) titanic_test.loc[titanic_test['Sex'] See how "Plcass", "Sex", "Title", and "Fare" are the best ? ,"Embarked","FamilySize","Title"]], [LogisticRegression(random_state=1), ["Pclass","Sex","Age","Fare
合并后一共1309条数据,并可以看到:age,cabin,embarked,Fare 四个特征有缺失值,其中cabin缺失比较严重。 df.describe() ? Fare分布和特征分析 # 箱型图特征分析 fig, [ax1,ax2] = plt.subplots(1,2,figsize=(20,6)) sns.boxplot(x="Pclass", y="Fare .set_title('Fare特征在Pclass下的箱型图', fontsize = 18) ax2.set_title("Fare特征在Pclass下的kde图", fontsize = 18) fig.show 结论: 观察到Pclass1相对于2和3的Fare比较高,因为地位高,财富多。但是Pclass1中有几个大于500的异常值存在,看一下这些异常数据。 df.loc[df['Fare']>500] ? 另外两个人的Embarked,Ticket,Fare也都一样,这说明这个大于500的Fare可能不是异常值。后面我们会对这些进行特征工程来特殊对待。
Parch Ticket Fare Cabin Embarked 0 1 0 3 Parch Ticket Fare Cabin Embarked 1 2 1 1 Cumings, Mrs. Survived Pclass Name Sex Age SibSp Parch Ticket Fare "]].median() Out[36]: Age 28.0000 Fare 14.4542 dtype: float64 更多信息: titanic[["Age", "Fare"]]. describe() Out[37]: Age Fare count 714.000000 891.000000 mean 29.699118
['Fare'].astype(int) # 分别得到Fare变量对应的幸存和没有幸存的记录,(这种引用很像R语言中的which()函数): fare_not_survived = titanic_df ["Fare"][titanic_df["Survived"] == 0] fare_survived = titanic_df["Fare"][titanic_df["Survived"] = = 1] # 转换成数据框DataFrame,并作图出来: avgerage_fare = DataFrame([fare_not_survived.mean(), fare_survived.mean ()]) std_fare = DataFrame([fare_not_survived.std(), fare_survived.std()]) titanic_df['Fare'].plot = std_fare.index.names = ["Survived"] avgerage_fare.plot(yerr=std_fare,kind='bar',legend=False) ?
每个司机的提示费求和(传统事件时间翻转窗口)DataStream<Tuple3<Long, Long, Float>> hourlyTips = fares .keyBy((TaxiFare fare ) -> fare.driverId) .window(TumblingEventTimeWindows.of(Duration.ofSeconds(5))) .process ) -> fare.driverId) .process(new PseudoWindow(Duration.ofSeconds(5)));// 伪窗口:按事件时间把每条数据归入其所在小时段 <TaxiFare>forBoundedOutOfOrderness(Duration.ofSeconds(5)) .withTimestampAssigner((fare , ts) -> fare.getEventTime()), "Kafka Fares");DataStream<Tuple3<Long, Long, Float>> hourlyTips