我有一个50000农民的数据集,他们在一些村庄种植庄稼。我必须弄清楚有多少农民在同一调查数量的土地和有多少他的作物面积产出图像附加。
这是我的虚拟数据集
df
Out[5]:
Name Village Survey_no Land_Area
0 Farmer_1 Village_1 26 0.33
1 Farmer_1 Village_1 26 0.40
2 Farmer_2 Village_1 26 0.30
3 Farmer_2 Village_1 26 0.40
4 Farmer_2 Village_1 26 0.50
5 Farmer_3 Village_1 26 0.52
6 Farmer_3 Village_1 26 0.40
7 Farmer_4 Village_1 151 0.23
8 Farmer_5 Village_1 151 0.25
9 Farmer_5 Village_1 151 0.10这是实际需要的输出

以下是我到目前为止所拥有的:
df = (df.set_index(['Village','Survey_no', df.groupby(['Village','Survey_no']).cumcount().add(1)]).unstack().sort_index(axis=1, level=1))
df.columns = ['{}-{}'.format(x, y) for x, y in df.columns]
df = df.reset_index()
df
Village Survey_no Land_Area-1 ... Name-6 Land_Area-7 Name-7
0 Village_1 26 0.33 ... Farmer_3 0.4 Farmer_3
1 Village_1 151 0.23 ... NaN NaN NaN产出是不正确的,因为我没有得到实际农民明智的土地总面积和在同一土地上的农民人数。
发布于 2022-05-31 17:37:22
更新:
dfs= df.groupby(['Name', 'Village', 'Survey_no']).agg('sum')
dfs = dfs.reset_index(level=0).set_index([dfs.groupby(['Village', 'Survey_no']).cumcount() + 1], append=True)\
.unstack().sort_index(level=1, axis=1)
dfs.columns = [f'{i}_{j}' for i, j in dfs.columns]
dfs = dfs.assign(Total_Land_Area=dfs.filter(like='Land_Area').sum(axis=1))
dfs输出:
Land_Area_1 Name_1 Land_Area_2 Name_2 Land_Area_3 Name_3 Total_Land_Area
Village Survey_no
Village_1 26 0.73 Farmer_1 1.20 Farmer_2 0.92 Farmer_3 2.85
151 0.23 Farmer_4 0.35 Farmer_5 NaN NaN 0.58试试这个:
cnt = df.groupby(['Village', 'Survey_no']).cumcount()+1
dfs= df.groupby(['Village', 'Survey_no', cnt]).agg({'Name':'first',
'Land_Area':'sum'})\
.unstack()\
.sort_index(level=1, axis=1)
dfs = dfs.assign(Total_Land_Area=dfs.filter(like='Land_Area').sum(axis=1))
dfs.columns = [f'{i}_{j}' if j else f'{i}' for i, j in dfs.columns]
dfs输出:
Land_Area_1 Name_1 ... Name_7 Total_Land_Area
Village Survey_no ...
Village_1 26 0.33 Farmer_1 ... Farmer_3 2.85
151 0.23 Farmer_4 ... NaN 0.58
[2 rows x 15 columns]发布于 2022-05-31 12:46:42
只有经验和力量才能做到这一点。如何加入bbb到aaa,想到超复杂的解决方案。但我不喜欢。
bbb = df.groupby(['Name'])['Land_Area'].aggregate(['sum'])
aaa = df.groupby(['Village', 'Survey_no']).aggregate({'Land_Area': 'sum', 'Name': 'nunique'}).reset_index()
aaa = aaa.rename(columns={"Name": "No.of Farmers"})输出bbb
sum
Name
Farmer_1 0.73
Farmer_2 1.20
Farmer_3 0.92
Farmer_4 0.23
Farmer_5 0.35输出aaa
Village Survey_no Land_Area No.of Farmers
0 Village_1 26 2.85 3
1 Village_1 151 0.58 2https://stackoverflow.com/questions/72447604
复制相似问题