我有一个xlsx文件,其中包含按以下问题排序的调查数据:
df = pd.DataFrame({
'Question 1': ['5-6 hours', '6-7 hours', '9-10 hours'],
'Question 2': ['Very restful', 'Somewhat restful', 'Somewhat restful'],
'Question 3': ['[Home (dorm; apartment)]', '[Vehicle;None of the above; Other]', '[Campus;Home (dorm; apartment);Vehicle]'],
'Question 4': ['[Family;No one; alone]', '[Classmates; students;Family;No one; alone]', '[Family]'],
})
>>> df
Question 1 Question 2 Question 3 Question 4
5-6 hours Very restful [Home (dorm; apartment)] [Family;No one; alone]
6-7 hours Somewhat restful [Vehicle;None of the above; Other] [Classmates; students;Family;No one; alone]
9-10 hours Somewhat restful [Campus;Home (dorm; apartment);Vehicle] [Family] 对于问题3和4,输入是复选框风格,允许多个答案。如何获取特定答案选项的值计数值,而不是整个单元格的值计数值?
e.g
Question 4
Family 3
No one; alone 2
Classmates; students 1目前我正在做这件事:
files = os.listdir()
for filename in files:
if filename.endswith(".xlsx"):
df = pd.read_excel(filename)
for column in df:
x = pd.Series(df[column].values).value_counts()
print(x)然而,这并不允许我将具有多个答案的单元格分开。谢谢!
发布于 2020-01-07 03:24:59
这只是你的一部分,但我不知道如何解析你的数据。例如,如果在Question 3中使用分号作为分隔符,则解析后的字符串将以['Home (dorm", " apartment)"]结尾。
>>> pd.Series([choice.strip()
for choice in df['Question 4'].str[1:-1].str.split(';').sum()]
).value_counts()
Family 3
alone 2
No one 2
Classmates 1
students 1
dtype: int64发布于 2020-01-07 03:26:35
你是说groupby?https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/
df1 = df.groupby('Question 4')
或groupby('...').agg(...)
https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
https://stackoverflow.com/questions/59617712
复制相似问题