首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何从带有多个逗号分隔值的列中计数一个单词的实例数?

如何从带有多个逗号分隔值的列中计数一个单词的实例数?
EN

Stack Overflow用户
提问于 2017-06-16 04:01:01
回答 2查看 126关注 0票数 2

所以我基本上是在分析一个调查数据集。数据集如下所示:

代码语言:javascript
复制
   Respondent       Country           HaveWorkedLanguage
0     1             United States     Swift
1     2             United Kingdom    JavaScript; Python; Ruby; SQL
2     3             United Kingdom    Java; PHP; Python
3     4             United States     Matlab; Python; R; SQL
4     5             Switzerland       NaN
5     6             New Zealand       JavaScript; PHP; Rust

如您所见,列HaveWorkedLanguage在每个单元格中都有具有单个值或多个值的实例。我想做的是分析每个国家最著名的语言。为此,我第一次表演了这样的群:

代码语言:javascript
复制
stu=students.groupby(['Country','HaveWorkedLanguage'])['Respondent'].count().reset_index()
stu.columns=[['Country','Known_Languages','Count']]

我得到了这样的数据:

代码语言:javascript
复制
    Country         Known_Languages                             Count
0   Afghanistan     Assembly; C; C++; Hack; Java; JavaScript    1
1   Afghanistan     C                                           1
2   Albania         C#; Java; Python; SQL                       1
3   Albania         C++; C#; Java; JavaScript; PHP              1
4   Albania         C++; C#; JavaScript; SQL                    1
5   Albania         C++; Java; JavaScript; PHP; SQL             2

我实际上想要一个数据显示国家和每一种语言的计数,这样最高的数字显示了最著名的语言。数据格式应该如下所示:

代码语言:javascript
复制
      Country           Known_Languages     Count
0     United States    Java                 100
1     United States    Python               80

早些时候,我能够使用以下代码找到整个著名的语言:

代码语言:javascript
复制
for i in ['C','C++','C#','Java','Python','R','JavaScript']:
    print(i,':',survey['HaveWorkedLanguage'].apply(lambda x: i in str(x).split('; ')).value_counts()[1]) 

产出如下:

代码语言:javascript
复制
C : 6974
C++ : 8155
C# : 12476
Java : 14524
Python : 11704
R : 1634
JavaScript : 22875

但现在我也想把这个国家和它联系起来。我该怎么做?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-06-16 04:09:19

代码语言:javascript
复制
hwl = students.HaveWorkedLanguage
cty = students.Country
stu = hwl.str.get_dummies('; ').groupby(cty).sum()
pd.concat(
    [stu.idxmax(1), stu.max(1)],
    axis=1, keys=['Lang', 'Count']
)

                      Lang  Count
Country                          
New Zealand     JavaScript      1
Switzerland           Java      0
United Kingdom      Python      2
United States       Matlab      1

项目/kill

numpy技术

代码语言:javascript
复制
mask = students.HaveWorkedLanguage.notnull().values
fc, uc = pd.factorize(students.Country.values.astype(str))
hwl = students.HaveWorkedLanguage.values.astype(str)
lol = np.core.defchararray.split(hwl, '; ')
lol[np.flatnonzero(~mask)] = [[]]
i = fc.repeat([len(l) for l in lol])
j, ul = pd.factorize(np.concatenate(lol))
n = uc.size
m = ul.size
counts = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
x = counts.argmax(1)
pd.DataFrame(
    np.column_stack([ul[x], counts[np.arange(n), x]]),
    uc, ['Lang', 'Count'])

                      Lang Count
United States        Swift     1
United Kingdom      Python     2
Switzerland          Swift     0
New Zealand     JavaScript     1

定时

代码语言:javascript
复制
%%timeit
hwl = students.HaveWorkedLanguage
cty = students.Country
stu = hwl.str.get_dummies('; ').groupby(cty).sum()
pd.concat(
    [stu.idxmax(1), stu.max(1)],
    axis=1, keys=['Lang', 'Count']
)
100 loops, best of 3: 3.22 ms per loop

%%timeit
mask = students.HaveWorkedLanguage.notnull().values
fc, uc = pd.factorize(students.Country.values.astype(str))
hwl = students.HaveWorkedLanguage.values.astype(str)
lol = np.core.defchararray.split(hwl, '; ')
lol[np.flatnonzero(~mask)] = [[]]
i = fc.repeat([len(l) for l in lol])
j, ul = pd.factorize(np.concatenate(lol))
n = uc.size
m = ul.size
counts = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
x = counts.argmax(1)
pd.DataFrame(np.column_stack([ul[x], counts[np.arange(n), x]]), uc, ['Lang', 'Count'])
1000 loops, best of 3: 570 µs per loop
票数 1
EN

Stack Overflow用户

发布于 2017-06-16 06:36:41

我做了很多步骤,所以也许有人有更多的丙酮溶液:

代码语言:javascript
复制
df = pd.DataFrame({"Country":["UK", "UK", "UK", "USA", "USA", "USA"], "Languages":["Python" , "Python, PHP, Java", "Java", "Python", "Java", "Python, Javascript"]})
df

    Country Languages
0   UK  Python
1   UK  Python, PHP, Java
2   UK  Java
3   USA Python
4   USA Java
5   USA Python, Javascript

df2 = df.Languages.apply(lambda row: pd.Series(row.split(","))).copy() # split the column
df3 = pd.get_dummies(df2, prefix_sep="", prefix="") # get dummies
df3


    Java    Python  Javascript  PHP Java
0   0   1   0   0   0
1   0   1   0   1   1
2   1   0   0   0   0
3   0   1   0   0   0
4   1   0   0   0   0
5   0   1   1   0   0

df4 = pd.merge(df[["Country"]], df3,  left_index=True, right_index=True)
df4

    Country Java    Python  Javascript  PHP Java
0   UK  0   1   0   0   0
1   UK  0   1   0   1   1
2   UK  1   0   0   0   0
3   USA 0   1   0   0   0
4   USA 1   0   0   0   0
5   USA 0   1   1   0   0

df5 = df4.groupby("Country").sum().reset_index().copy() # sum it
df5

    Country Java    Python  Javascript  PHP Java
0   UK  1   2   0   1   1
1   USA 1   2   1   0   0

df6 = pd.melt(df5, id_vars=["Country"], var_name="Language", value_name="Value") # columns to rows
df6

    Country Language    Value
0   UK  Java    1
1   USA Java    1
2   UK  Python  2
3   USA Python  2
4   UK  Javascript  0
5   USA Javascript  1
6   UK  PHP 1
7   USA PHP 0
8   UK  Java    1
9   USA Java    0

df7 = df6.sort_values(by=["Country", "Value"], ascending=False) # sort
df7


    Country Language    Value
3   USA Python  2
1   USA Java    1
5   USA Javascript  1
7   USA PHP 0
9   USA Java    0
2   UK  Python  2
0   UK  Java    1
6   UK  PHP 1
8   UK  Java    1
4   UK  Javascript  0
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/44580587

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档