我正在尝试根据列表中的in过滤数据集。这种方法太慢了。数据集是一个箭头数据集。从huggingface导入数据。
import numpy as np
from datasets import load_dataset, DatasetDict
from collections import Counter
import pyarrow as pa
import pandas as pd
responses = load_dataset('peixian/rtGender', 'responses', split = 'train')
# post_id_test_list contains list of ids
responses_test = responses.filter(lambda x: x['post_id'] in post_id_test_list)发布于 2021-09-23 08:51:17
从load_dataset获得的数据集不是arrow Dataset,而是hugging face Dataset。不过,它是由一个箭头表格支持的。
应用lambda过滤器会很慢,如果你想要一个更快的顶点操作,你可以尝试直接修改底层的箭头表:
import pyarrow as pa
import pyarrow.compute as compute
table = responses.data
flags = compute.is_in(table['post_id'], value_set=pa.array(post_id_test_list, pa.int32()))
filtered_table = table.filter(flags)
filtered_respoonse = datasets.DataSet(filtered_table, response.info, response.split)虽然我不能百分之百地确定最后一行是否是使用箭头表重新创建数据集的正确方法。
发布于 2021-09-26 18:54:50
几乎把我逼到了那一步。如前所述,最后一行不起作用,但我可以转换为pandas / save out等。谢谢!
import pyarrow as pa
import pyarrow.compute as compute
table = responses.data
flags = compute.is_in(table['post_id'], value_set=pa.array(post_id_test_list, pa.int32()))
filtered_table = table.filter(flags)
filtered_table.to_pandas()https://stackoverflow.com/questions/69290604
复制相似问题