我有一个有两个感兴趣文件的数据drame : docID和类别。注意,实际内容与其他字段一起也是此数据格式的一部分。
JAN001新闻,体育
JAN212政治
FEB208商务新闻
我试图用Pandas创建一个新的dataframe,如下所示:
JAN001新闻
JAN001体育
JAN212政治..。
我知道我可以通过数据循环,但对熊猫来说是个新手,我认为有一个更有效的方法。我试着看了几个问题,并尝试了各种例子,但到目前为止都没有成功。我也很好奇,指数是否是解决方案的一部分,但尚未探索这一途径。谢谢你的帮助或建议。
更新-这是代码和
{
foo = pd.read_csv("dtu_topic.txt", sep = "\t")
foo = foo[:20]
print foo
# id dtu_docid dtu_topic \
#0 21523 2012-1553 Energy Taxation,State & Local Taxation
#1 21522 2012-1552 Legislation & Policy\Financial Services
#2 25470 2010-0227 Quantitative Economics and Statistics
#3 25477 2010-0215 International Taxation\Asia
#4 21539 2012-1529 Ernst & Young Newsletters\This Week in Tax Reform
#5 25483 2010-0207 State & Local Taxation
#6 21536 2012-1533 Payroll & Employment Tax\State & Local
#7 21537 2012-1532 Payroll & Employment Tax\State & Local
#8 24943 2010-0929 IRS Practice & Procedure,Tax Quality & Risk Ma...
#9 25500 2010-0185 Financial Services Industries
#10 21542 2012-1524 Payroll & Employment Tax\State & Local
#11 21551 2012-1507 Personal Finance
#12 25523 2010-0159 International Taxation\Europe
#13 21549 2012-1510 Payroll & Employment Tax\State & Local
#14 21557 2012-1501 Payroll & Employment Tax\Federal,Payroll & Emp...
#15 21558 2012-1498 Accounting Methods & Inventories
#16 25567 2010-0104 Real Estate
#17 25529 2010-0152 Financial Services Industries,International Ta...
#18 21564 2012-1495 IRS Practice & Procedure
#19 21563 2012-1494 Payroll & Employment Tax\Federal
#parse dtu_topic into a list of categories
foo["dtu_topic_split"] = foo.dtu_topic.str.replace(',','\\')
foo["dtu_topic_split"] = foo.dtu_topic_split.str.split('\\').tolist()
# from example on stack overflow - get syntax error
dcm = foo[,list(dtu_docid = dtu_docid,
dtu_topic = unlist(dtu_topic.split),
by = 1:nrow(foo)]
#dt.2 <- dt[, list(Probe.Id = Probe.Id,
# Gene.Id = unlist(Gene.Id_split),
# Score.d = Score.d), by = 1:nrow(dt)]
#dcm = unlist(foo.dtu_topic_split)
print dcm}
发布于 2013-06-29 16:20:06
看起来,您正在尝试将列表框架转换为有用的内容(您的示例实际上在您感兴趣的列中只有一个列表)。
试试像这样的东西
In [101]: df = DataFrame(dict(A = [['foo','bar','bah']], B = [['foo','bah']], C = [['foo']]),index=range(4))
In [102]: df
Out[102]:
A B C
0 [foo, bar, bah] [foo, bah] [foo]
1 [foo, bar, bah] [foo, bah] [foo]
2 [foo, bar, bah] [foo, bah] [foo]
3 [foo, bar, bah] [foo, bah] [foo]
In [103]: concat(dict([ (row[0],row[1].apply(lambda y: Series(y))) for row in df.iterrows() ]))
Out[103]:
0 1 2
0 A foo bar bah
B foo bah NaN
C foo NaN NaN
1 A foo bar bah
B foo bah NaN
C foo NaN NaN
2 A foo bar bah
B foo bah NaN
C foo NaN NaN
3 A foo bar bah
B foo bah NaN
C foo NaN NaNhttps://stackoverflow.com/questions/17381582
复制相似问题