我正试图建立一个多等级、多标签的模型,根据情节对电影类型进行分类。有24种不同的电影类型,这是按类型分列的电影数量:
genre number_of_movies
Drama 3965
Comedy 3046
Thriller 2024
Romance 1892
Crime 1447
Action 1303
Adventure 1024
Horror 954
Mystery 759
Sci-Fi 723
Fantasy 707
Family 682
Documentary 419
Biography 373
War 348
Music 341
History 273
Musical 271
Sport 261
Animation 260
Western 237
Film-Noir 168
Short 92
News 7我正在使用CountVectorizer()创建特性,如下所述:
vect = CountVectorizer(max_features=4412, stop_words='english', ngram_range=(1, 3), binary=True)
X = vect.fit_transform(df['plot'])
X.shape输出:
(7895, 4412)和MultiLabelBinarizer()用于创建y_genres:
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])
y_genres.shape输出:
(7895, 24)其目标是重采样除多数类之外的所有类,使用RandomOverSampler并从imblearn.over_sampling中删除。然而,在使用:
ros = RandomOverSampler(random_state=42)
X_resampled, Y_resampled = ros.fit_sample(X, y_genres)
Y_resampled.shape输出:
(52690, 22)sm = SMOTE(random_state=42)
X_resampled, Y_resampled = sm.fit_sample(X, y_genres)错误:
Expected n_neighbors <= n_samples, but n_samples = 2, n_neighbors = 6我应该做些什么来解决前面描述的两个问题?
发布于 2019-04-14 08:10:43
sm.fit_resample可能是救援人员。
https://stackoverflow.com/questions/55670092
复制相似问题