我从一个服务器收到了一个制表符分隔的文件,该文件根据每个受访者的情况输出问题的答案。我想将数据导入到pandas数据框中,其中列是每个问题,行是每个受访者的答案。以下是一位受访者的情况:
[2072] Anonymous
Q-0 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.14 Student (Graduate/ Undergraduate)
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 1|1|1|1|4|
Q-2 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 1-3
Q-3 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Male
Q-4 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 18-24
Q-5 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00
Q-6 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Prefer not to answer
Q-7 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Yes
Q-8 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.13 Bachelor's Degree
Q-9 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Other
Q-10 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Mathematics
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 High school
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 College (introductory courses)
Q-12 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 Professional
Q-13 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Mac OS X
Q-14 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.25 Every week
Q-15 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 A test that proves or disproves of some abstract theory about the world
Q-16 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00
Q-17 [01] Sat 25 May 2013 7:43 PM UTC +0000 2.00 Yes
Q-18 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00
Q-19 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.20 Timely feedback from the instructor
Q-20 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 在每个受访者的答案之间有一个回车。谢谢你的帮助!
发布于 2013-05-30 23:28:02
重要的一步是勾勒出每个受访者的区块。如果重写文件,在每一行前面加上应答者的ID,会怎么样?例如,在“匿名”的情况下,我看到的是"2072“。
import re
f = open('new_file', 'w')
for line in open('filename'):
# line might be like [####] Student_Name or Q-...
m = re.match('\[(\d+)\] .*', line)
if m:
# Line is like [####] Student_name.
respondent_id = m.group(1)
continue
# Line is like Q-...
# Write new line like #### Q-...
f.write(str(respondent_id) + line)然后使用pandas read_csv加载这个修改后的文件,将前两列分配给索引。(它们将成为MultiIndex。)然后使用unstack将Q的索引旋转到列中。
(完全公开:我测试了正则表达式,但还没有全部测试。)
发布于 2013-05-31 00:21:41
以下是对我有效的方法:
import re
f = open('new_file', 'w')
for line in open('filename'):
m = re.match('\[\d+\]*', line)
if m:
respondent_id = m.group()
continue
f.write(str(respondent_id) + line)https://stackoverflow.com/questions/16839329
复制相似问题