我很难用我需要的方式掩盖我的数据帧。我的数据是针对产品的,其中一个产品可以以不同的格式或语言出现。它看起来像是:
import pandas as pd
from numpy.random import choice
prods = [1234,1234,1234,1234,12344,12344,12344,12344,3462,3462,3462,3462,12314,12314,12314,12314,12857,12857,12857,12857]
formats = choice(['Hrd','Elc','Sft'],size=20)
language = choice(['Eng','Spa','Jpn','Chn','Port','Fnch','Rus'],size=20)
restricted = choice(range(5,9),size=20)
df = pd.DataFrame({'products': prods,'formats':formats,'language': language, 'restricted': restricted})
df['instances'] = df['products'].astype(str) + '-' + df['formats'] + '-' + df['language']
md = pd.MultiIndex.from_tuples(list(zip(df['products'],df['instances'])))
df.set_index(md)
df
Out[1]:
formats language products restricted instances
1234 1234-Sft-Port Sft Port 1234 5 1234-Sft-Port
1234-Elc-Jpn Elc Jpn 1234 7 1234-Elc-Jpn
1234-Hrd-Jpn Hrd Jpn 1234 7 1234-Hrd-Jpn
1234-Hrd-Chn Hrd Chn 1234 5 1234-Hrd-Chn
12344 12344-Sft-Chn Sft Chn 12344 5 12344-Sft-Chn
12344-Hrd-Spa Hrd Spa 12344 7 12344-Hrd-Spa
12344-Elc-Jpn Elc Jpn 12344 6 12344-Elc-Jpn
12344-Sft-Port Sft Port 12344 5 12344-Sft-Port
3462 3462-Hrd-Jpn Hrd Jpn 3462 5 3462-Hrd-Jpn
3462-Hrd-Jpn Hrd Jpn 3462 7 3462-Hrd-Jpn
3462-Sft-Port Sft Port 3462 6 3462-Sft-Port
3462-Elc-Jpn Elc Jpn 3462 7 3462-Elc-Jpn
12314 12314-Sft-Rus Sft Rus 12314 5 12314-Sft-Rus
12314-Elc-Spa Elc Spa 12314 5 12314-Elc-Spa
12314-Hrd-Port Hrd Port 12314 7 12314-Hrd-Port
12314-Elc-Port Elc Port 12314 7 12314-Elc-Port
12857 12857-Elc-Jpn Elc Jpn 12857 8 12857-Elc-Jpn
12857-Elc-Spa Elc Spa 12857 5 12857-Elc-Spa
12857-Hrd-Chn Hrd Chn 12857 5 12857-Hrd-Chn
12857-Sft-Port Sft Port 12857 7 12857-Sft-Port如何为多个变量屏蔽或索引?我想指定这样的东西:“选择一个产品,其中一个电子格式是西班牙语,另一个格式是俄罗斯精装书”。我不能像df[(df['language'] == 'Spa') & (df['format'] == 'Elc')]那样简单地屏蔽我的数据框架,因为这不会过滤包含精装的产品的其他格式。
我使用了复杂的groupby lambda函数,但是对于大数据帧(我的是200,000行),这是非常慢的:
mask = df.groupby('products')
mask.apply(lambda x:
'spa' in x['formats'].values and
'Hrd' in x[x['language']=='Rus']['formats'].values
)我已经研究过df.query()和许多其他方法/函数,但似乎找不到一种方法来与我的数据文件进行交互,因为产品需要分组。有更好的办法吗?
发布于 2018-08-22 22:16:00
我无法重现您的确切结果,因为您没有使用确定性的随机种子,但我可以使用‘或’运算符|通过两个交叉点的联合进行索引。
import pandas as pd
from numpy.random import RandomState
...:
...: rand = RandomState(4321)
...: prods = [1234,1234,1234,1234,12344,12344,12344,12344,3462,3462,3462,3462,12314,1231
...: 4,12314,12314,12857,12857,12857,12857]
...: formats = rand.choice(['Hrd','Elc','Sft'],size=20)
...: language = rand.choice(['Eng','Spa','Jpn','Chn','Port','Fnch','Rus'],size=20)
...: restricted = rand.choice(range(5,9),size=20)
...: df = pd.DataFrame({'products': prods,'formats':formats,'language': language, 'restr
...: icted': restricted})
...: df['instances'] = df['products'].astype(str) + '-' + df['formats'] + '-' + df['lang
...: uage']
...: md = pd.MultiIndex.from_tuples(list(zip(df['products'],df['instances'])))
...: df2 = df.set_index(md)
...: df2
...:
...:
Out[1]:
products formats language restricted instances
1234 1234-Elc-Spa 1234 Elc Spa 8 1234-Elc-Spa
1234-Sft-Rus 1234 Sft Rus 8 1234-Sft-Rus
1234-Hrd-Spa 1234 Hrd Spa 7 1234-Hrd-Spa
1234-Sft-Spa 1234 Sft Spa 7 1234-Sft-Spa
12344 12344-Hrd-Spa 12344 Hrd Spa 8 12344-Hrd-Spa
12344-Sft-Rus 12344 Sft Rus 5 12344-Sft-Rus
12344-Elc-Fnch 12344 Elc Fnch 7 12344-Elc-Fnch
12344-Elc-Spa 12344 Elc Spa 6 12344-Elc-Spa
3462 3462-Elc-Fnch 3462 Elc Fnch 8 3462-Elc-Fnch
3462-Sft-Jpn 3462 Sft Jpn 6 3462-Sft-Jpn
3462-Hrd-Port 3462 Hrd Port 6 3462-Hrd-Port
3462-Sft-Eng 3462 Sft Eng 8 3462-Sft-Eng
12314 12314-Elc-Spa 12314 Elc Spa 7 12314-Elc-Spa
12314-Hrd-Spa 12314 Hrd Spa 7 12314-Hrd-Spa
12314-Elc-Fnch 12314 Elc Fnch 7 12314-Elc-Fnch
12314-Hrd-Port 12314 Hrd Port 5 12314-Hrd-Port
12857 12857-Hrd-Port 12857 Hrd Port 7 12857-Hrd-Port
12857-Sft-Rus 12857 Sft Rus 5 12857-Sft-Rus
12857-Elc-Rus 12857 Elc Rus 6 12857-Elc-Rus
12857-Elc-Jpn 12857 Elc Jpn 8 12857-Elc-Jpn当然还有一个更优雅的解决方案(或者更短的黑客,比如将products和formats中的字符串连接起来,并对结果进行过滤),但这是有效的:
filter_df = df2[((df2.formats == 'Elc') & (df2.language == 'Spa')) | ((df2.formats == 'Sft') & (df2.language == 'Rus'))]
filter_groups = filter_df.groupby(level=0)['products'].count()
filter_index = filter_groups[filter_groups > 1].index
df3 = df2[df2.index.get_level_values(0).isin(filter_index)]
Out[3]:
products formats language restricted instances
1234 1234-Elc-Spa 1234 Elc Spa 8 1234-Elc-Spa
1234-Sft-Rus 1234 Sft Rus 8 1234-Sft-Rus
1234-Hrd-Spa 1234 Hrd Spa 7 1234-Hrd-Spa
1234-Sft-Spa 1234 Sft Spa 7 1234-Sft-Spa
12344 12344-Hrd-Spa 12344 Hrd Spa 8 12344-Hrd-Spa
12344-Sft-Rus 12344 Sft Rus 5 12344-Sft-Rus
12344-Elc-Fnch 12344 Elc Fnch 7 12344-Elc-Fnch
12344-Elc-Spa 12344 Elc Spa 6 12344-Elc-Spa您已经要求选择产品;如果只想缩小到与产品匹配的那些实例,则需要再次对这些结果进行筛选。
https://stackoverflow.com/questions/49160285
复制相似问题