我这里有一个数据集:https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv
加载它的代码是:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
fullht_df = pd.read_csv(url)
fullht_df.head(n=100)
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)

我试图迭代性别(性别)列,并替换未知值(?)有一些合理的价值。该值将是"M“或"F”的计算值-取决于其他对问题不重要的算法。
我是熊猫的新手,由于某种原因,事实证明这比我想象的更困难。
迭代列系列和测试的最佳方法是什么?
因为有很多未知的值是我第一次替换的?用np.NaN
# Replace with NaN so many of the Pandas functions will work.
ht_df = ht_df.replace('?', np.NaN) 这让我很好地用平均值更新所有缺少的数值(除了解释为什么用NaN替换所有内容之外,这对这个问题并不重要):
# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])但是现在我只剩下迭代“性爱”一栏来代替,而且我不能很好地迭代它。
我使用了下面的代码来帮助我理解正在发生的事情。我只包括了输出的一个样本。
for item in ht_df["sex"]:
print(f"{item} {type(item)}")输出:
F <class 'str'>
F <class 'str'>
... <snip> ...
F <class 'str'>
F <class 'str'>
M <class 'str'>
F <class 'str'>
nan <class 'float'>
F <class 'str'>nan是一个浮子,这是有道理的。但我不能这样测试:
for item in ht_df["sex"]:
if item == np.NaN:
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")if条件永远不会触发。
当我在NaN上迭代,然后用一个新的值更新该单元格时,我如何测试它的值?
这里有一个完整的测试代码:
import pandas as pd
import numpy as np
import ssl
from pandas.core.arrays import string_
from pandas.core.frame import DataFrame
def main():
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://raw.githubusercontent.com/bryonbaker/datasets/main/SIT720/Ass1/hypothyroid.csv'
fullht_df = pd.read_csv(url)
print(fullht_df.head(n=100))
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)
# Display the dataset's dimension
print(f"Working dataset dimension is: {ht_df.shape}\n")
# Get the first 500 rows from the dataset and use that for the rest of the assignment.
ht_df = fullht_df.head(n=500)
# Cells with missing data have a '?' in them.
# First replace ? with np.NaN so we can utilise some other nice Pandas dataframe methods. We can use a global replace because, upon dataset ins[ection, the unknown ('?') only exists in the numeric columns.
# Convert the value columns from text to numeric.
# Calculate the median value for the numeric-data coluimns
# Replace the NaN values with a reasonable value. For this exercise we have chosen the mean for the column
# Recalculate the median value for the numeric-data coluimns
# Prepare the data so it is calculable
ht_df = ht_df.replace('?', np.NaN) # Replace with NaN so many of the Pandas functions will work.
ht_df[["TSH","T3","TT4","FTI"]] = ht_df[["TSH","T3","TT4","FTI"]].apply(pd.to_numeric) # CSV loads as text. Convert the cells to numeric
# Calculate the Mean and Median prior to replacing missing values
mean = ht_df[["TSH","T3","TT4","FTI"]].mean(skipna=True)
median = ht_df[["TSH","T3","TT4","FTI"]].median(skipna=True)
# Replace the NaN's of the numeric columns with the mean
ht_df["TSH"] = ht_df["TSH"].fillna(mean["TSH"])
ht_df["T3"] = ht_df["TSH"].fillna(mean["T3"])
ht_df["TT4"] = ht_df["TSH"].fillna(mean["TT4"])
ht_df["FTI"] = ht_df["TSH"].fillna(mean["FTI"])
# Replace the M/F missing values with the most frequently occuring gender provided "pregnant" is false. Otherwise set the value to F.
print("@@@@@@@@@@@@@@")
for item in ht_df["sex"]:
if item == np.NaN:
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")
print("@@@@@@@@@@@@@@")
if __name__ == "__main__":
main()发布于 2022-03-20 10:46:54
你不能检查item == np.NaN,你必须使用pd.isna(item)
for item in ht_df["sex"]:
if pd.isna(item):
print(f"{item} is NaN\n")
print(f"{item} {type(item)}")输出:
...
M <class 'str'>
F <class 'str'>
nan is NaN
nan <class 'float'>
F <class 'str'>
...发布于 2022-03-20 10:42:57
我不知道为什么要使用迭代来打印每个项目。但是,如果您只想从'sex'列为np.nan的行中打印出来,那么:
print(ht_df["sex"].isna()) 对每个np.nan都会显示“真”或“假”
如果您只想看到带有这些行的dataframe元素,可以尝试如下所示:
print(ht_df.loc[ht_df["sex"].isna(), ["sex", "pregnant", "TSH"]])它将打印np.nan列中的所有行,然后打印这三列的值(我任意选择,您可以填写任何您喜欢的列表)。
最后,如果您想要创建一个公式来猜测一个np.nan行应该是M还是F,我通常会创建一个"sex_predict"列,使用您正在使用的任意值填充该列,然后使用fillna
ht_df["sex"] = ht_df["sex"].fillna(ht_df["sex_predict"])发布于 2022-03-20 11:40:00
感谢大家(尤指汤姆和科伦,因为他们都是正确的)的答案。对于我应该如何处理数据和熊猫,它们都很有启发性。我将两者合并到下面的解决方案中
tmp_col = "sex-predict"
ht_df[tmp_col] = ht_df["sex"]
for (index, row_series) in ht_df.iterrows():
if pd.isna(row_series["sex"]):
ht_df.at[index, tmp_col] = calc_gender(). # Calculate the value for the missing value.
# Copy over any NaN values in the sex column using the value from the temporary column
ht_df["sex"] = ht_df["sex"].fillna(ht_df[tmp_col])
ht_df = ht_df.drop([tmp_col], axis=1) # Drop the temporary columnhttps://stackoverflow.com/questions/71545872
复制相似问题