这需要太长时间:
# Document-frequency
phrases_final["doc_freq"] = len(phrases_final) * [0]
# for each phrase, compute the number of clusters that phrase occurs in
for phrase in phrases_final["extracted_phrases"]:
for i in cluster_name:
all_tweets = ""
for tweet in df["tweets_to_consider"][df.cl_num == i]:
all_tweets = all_tweets + tweet + ". "
if phrase in all_tweets:
phrases_final["doc_freq"][
(phrases_final.extracted_phrases == phrase) & (phrases_final.cluster_num == i)
] = (
phrases_final["doc_freq"][
(phrases_final.extracted_phrases == phrase) & (phrases_final.cluster_num == i)
]
+ 1
)发布于 2020-11-08 14:54:44
all_tweets,而不是对每个短语再次计算。另一种情况是,您可能根本不想构造maybe?,因为if phrase in (long_string_here)会很慢。
phrase).
phrases_final初始化为整数列表,所以索引可能完全是假的),考虑由(cluster_num, phrase)元组索引的collections.Counter() (或由cluster_num索引的collections.defaultdict(collections.Counter) ),那么使用multiprocessing.Pool()对短语或集群进行并行化仍然太慢。
F 223
https://stackoverflow.com/questions/64739405
复制相似问题