所以,我有一个左DF和一个右DF。它们都有相同的列。但是当我尝试对索引进行连接时,使用left.join(right, lsuffix='_CAN', rsuffix='_US')时,正确的数据集(US)具有所有的NAN值。
感谢您的帮助。
使用的数据集:这里是Kaggle数据集:https://www.kaggle.com/datasnaek/youtube-new。我刚刚使用了CAD和USA (请参阅页面左侧的Data Explorer以及CAvideos.csv和USvideos.csv)。我读到了这些,现在我想通过索引加入它们。
发布于 2020-10-25 01:54:55
不,87284条记录中只有35675条是NA的。你可以通过Series.isna().sum() (或者可能是.isnull().sum())来检查NA的数量,并与len(df)进行比较。你不能仅仅通过观察头部和尾部来得出这样的结论。
ca = pd.read_csv("/mnt/ramdisk/CAvideos.csv")
us = pd.read_csv("/mnt/ramdisk/USvideos.csv")
ca.set_index("video_id", inplace=True) # I assume this is the intended logic
us.set_index("video_id", inplace=True)
# ca=left, us=right
df = ca.join(us, lsuffix='_CAN', rsuffix='_US')
print(df["description_US"].isna().sum()) # 35675
print(len(df)) # 87284您还可以过滤掉非NA记录:
print(df[~df["description_US"].isna()])
trending_date_CAN ... description_US
video_id ...
n1WpP7iowLc 17.14.11 ... Eminem's new track Walk on Water ft. Beyoncé i...
n1WpP7iowLc 17.14.11 ... Eminem's new track Walk on Water ft. Beyoncé i...
n1WpP7iowLc 17.14.11 ... Eminem's new track Walk on Water ft. Beyoncé i...
n1WpP7iowLc 17.14.11 ... Eminem's new track Walk on Water ft. Beyoncé i...
n1WpP7iowLc 17.14.11 ... Eminem's new track Walk on Water ft. Beyoncé i...
... ... ...
Gi56dSh8Fq8 18.14.06 ... It's your good ol' McDonald's Big Mac, but lik...
JGm9Y_hFqNk 18.14.06 ... First Take's Stephen A. Smith and Max Kellerma...
Fmdb-KmlzD8 18.14.06 ... The Master Chief returns in Halo Infinite – th...
Fmdb-KmlzD8 18.14.06 ... The Master Chief returns in Halo Infinite – th...
Fmdb-KmlzD8 18.14.06 ... The Master Chief returns in Halo Infinite – th...https://stackoverflow.com/questions/64516079
复制相似问题