我在dataframe中有一个列,其类型是object。该列同时包含数组和字符串,其唯一值如下:
array(['[nan]', "['3. Medical and Health Sciences']",
"['1. Natural Sciences']", "['2. Engineering and Technology']",
"['1. Natural Sciences' '3. Medical and Health Sciences']",
"['3. Medical and Health Sciences' '1. Natural Sciences']",
"['5. Social Sciences']",
"['1. Natural Sciences' '4. Agricultural Sciences'\n '3. Medical and Health Sciences']",
"['1. Natural Sciences' '6. Humanities']",
"['4. Agricultural Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology']",
"['2. Engineering and Technology' '1. Natural Sciences']",
"['6. Humanities']",
"['3. Medical and Health Sciences' '6. Humanities']",
"['4. Agricultural Sciences' '3. Medical and Health Sciences']",
"['3. Medical and Health Sciences' '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '4. Agricultural Sciences']",
"['2. Engineering and Technology' '3. Medical and Health Sciences']",
"['1. Natural Sciences' '4. Agricultural Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences']",
"['4. Agricultural Sciences' '2. Engineering and Technology']",
"['6. Humanities' '1. Natural Sciences']",
"['6. Humanities' '2. Engineering and Technology']",
"['2. Engineering and Technology' '5. Social Sciences']",
"['1. Natural Sciences' '5. Social Sciences']",
"['3. Medical and Health Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
"['2. Engineering and Technology' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
"['4. Agricultural Sciences' '1. Natural Sciences']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['5. Social Sciences' '1. Natural Sciences']",
"['2. Engineering and Technology' '6. Humanities']",
"['2. Engineering and Technology' '4. Agricultural Sciences']",
"['5. Social Sciences' '4. Agricultural Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences'\n '1. Natural Sciences']",
"['6. Humanities' '3. Medical and Health Sciences']",
"['6. Humanities' '5. Social Sciences']",
"['5. Social Sciences' '3. Medical and Health Sciences']",
"['5. Social Sciences' '2. Engineering and Technology']",
"['5. Social Sciences' '6. Humanities']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
"['2. Engineering and Technology' '4. Agricultural Sciences'\n '1. Natural Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology' '6. Humanities']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '5. Social Sciences']",
"['4. Agricultural Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '3. Medical and Health Sciences']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
"['4. Agricultural Sciences' '5. Social Sciences']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '5. Social Sciences']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '4. Agricultural Sciences']",
"['2. Engineering and Technology' '5. Social Sciences'\n '1. Natural Sciences']",
"['1. Natural Sciences' '6. Humanities' '3. Medical and Health Sciences']",
"['4. Agricultural Sciences' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences' '6. Humanities'\n '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '4. Agricultural Sciences']",
"['6. Humanities' '4. Agricultural Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '5. Social Sciences' '2. Engineering and Technology']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '5. Social Sciences']",
"['6. Humanities' '2. Engineering and Technology' '1. Natural Sciences']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '4. Agricultural Sciences']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '5. Social Sciences']",
"['3. Medical and Health Sciences' '6. Humanities'\n '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '1. Natural Sciences' '6. Humanities']",
"['2. Engineering and Technology' '6. Humanities' '1. Natural Sciences']",
"['3. Medical and Health Sciences' '4. Agricultural Sciences'\n '1. Natural Sciences']",
"['1. Natural Sciences' '5. Social Sciences'\n '2. Engineering and Technology']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology' '4. Agricultural Sciences']",
"['1. Natural Sciences' '3. Medical and Health Sciences' '6. Humanities']",
"['1. Natural Sciences' '5. Social Sciences'\n '3. Medical and Health Sciences']",
"['6. Humanities' '3. Medical and Health Sciences' '1. Natural Sciences']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '4. Agricultural Sciences']",
"['1. Natural Sciences' '6. Humanities' '2. Engineering and Technology']",
"['4. Agricultural Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
"['5. Social Sciences' '3. Medical and Health Sciences'\n '1. Natural Sciences']",
"['6. Humanities' '1. Natural Sciences' '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '6. Humanities' '1. Natural Sciences']",
"['5. Social Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['2. Engineering and Technology' '4. Agricultural Sciences'\n '6. Humanities']",
"['1. Natural Sciences' '5. Social Sciences'\n '2. Engineering and Technology' '3. Medical and Health Sciences']",
"['6. Humanities' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['6. Humanities' '5. Social Sciences' '1. Natural Sciences']",
"['5. Social Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences']",
"['6. Humanities' '1. Natural Sciences' '3. Medical and Health Sciences']",
"['5. Social Sciences' '2. Engineering and Technology'\n '1. Natural Sciences']",
"['6. Humanities' '1. Natural Sciences' '4. Agricultural Sciences']",
"['1. Natural Sciences' '4. Agricultural Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '1. Natural Sciences'\n '3. Medical and Health Sciences' '2. Engineering and Technology']",
"['6. Humanities' '3. Medical and Health Sciences' '1. Natural Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '5. Social Sciences' '1. Natural Sciences']",
"['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences']",
"['4. Agricultural Sciences' '2. Engineering and Technology'\n '6. Humanities']",
"['3. Medical and Health Sciences' '1. Natural Sciences'\n '5. Social Sciences' '2. Engineering and Technology']",
"['2. Engineering and Technology' '1. Natural Sciences' '6. Humanities']",
"['3. Medical and Health Sciences' '6. Humanities'\n '2. Engineering and Technology' '5. Social Sciences'\n '1. Natural Sciences']",
"['4. Agricultural Sciences' '3. Medical and Health Sciences'\n '2. Engineering and Technology' '1. Natural Sciences']",
"['3. Medical and Health Sciences' '2. Engineering and Technology'\n '4. Agricultural Sciences' '1. Natural Sciences']",
"['2. Engineering and Technology' '5. Social Sciences'\n '3. Medical and Health Sciences']",
"['3. Medical and Health Sciences' '4. Agricultural Sciences'\n '2. Engineering and Technology']",
"['6. Humanities' '2. Engineering and Technology'\n '3. Medical and Health Sciences']",
"['4. Agricultural Sciences' '6. Humanities']",
"['1. Natural Sciences' '3. Medical and Health Sciences'\n '4. Agricultural Sciences' '2. Engineering and Technology']",
"['3. Medical and Health Sciences' '2. Engineering and Technology'\n '6. Humanities']",
"['2. Engineering and Technology' '1. Natural Sciences'\n '3. Medical and Health Sciences' '4. Agricultural Sciences']",
"['1. Natural Sciences' '6. Humanities' '3. Medical and Health Sciences'\n '2. Engineering and Technology']",
"['4. Agricultural Sciences' '1. Natural Sciences' '5. Social Sciences']",
"['2. Engineering and Technology' '5. Social Sciences' '6. Humanities']",
"['1. Natural Sciences' '2. Engineering and Technology'\n '3. Medical and Health Sciences' '6. Humanities']",
"['1. Natural Sciences' '4. Agricultural Sciences' '5. Social Sciences']",
"['3. Medical and Health Sciences' '5. Social Sciences'\n '2. Engineering and Technology']",
"['5. Social Sciences' '1. Natural Sciences' '4. Agricultural Sciences']",
"['2. Engineering and Technology' '3. Medical and Health Sciences'\n '1. Natural Sciences' '5. Social Sciences']",
"['2. Engineering and Technology' '6. Humanities'\n '3. Medical and Health Sciences' '1. Natural Sciences']",
"['5. Social Sciences' '6. Humanities' '3. Medical and Health Sciences']",
"['2. Engineering and Technology' '6. Humanities'\n '3. Medical and Health Sciences']",
"['5. Social Sciences' '6. Humanities' '1. Natural Sciences']"],
dtype=object)也许一个MWE来复制数据是这样的:
# initialize list of lists
data = [[1, [7,3], [1,1], "['5. Social Sciences' '6. Humanities' '1. Natural Sciences']"], [2, [1,5], [2,1], "['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences']"], [3, [1,2,6], [2,0,2], '[nan]'],[5, [1,2], [2,0], "['1. Natural Sciences']"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_patents','dist_cited_patents','Fields'])现在,我想获得的实际上是两个变量:一个是字符串列表(new_var1 ),另一个是简单字符串(下一个结果db中的new_var2),如下所示:
docdb. cited_patents dist_cited_patents new_var1
1 [7, 3] [1, 1] ['Social Sciences','Humanities','Natural Sciences']
2 [1, 5] [2, 1] ['Engineering and Technology','Social Sciences', 'Agricultural Sciences']
3 [1, 2, 6] [2, 0, 2] []
5 [1, 2] [2, 0] ['Natural Sciences']var2关心的是什么
docdb. cited_patents dist_cited_patents new_var2
1 [7, 3] [1, 1] 'Social Sciences, Humanities, Natural Sciences'
2 [1, 5] [2, 1] 'Engineering and Technology,Social Sciences,Agricultural Sciences'
3 [1, 2, 6] [2, 0, 2] ''
5 [1, 2] [2, 0] 'Natural Sciences'非常感谢
发布于 2022-10-20 17:20:31
我将使用regex模式查找所有匹配的事件:
df['new_var1'] = df['Fields'].str.findall(r"'\d+\.\s*(.*?)'")
df['new_var2'] = df['new_var1'].str.join(', ') docdb cited_patents dist_cited_patents Fields new_var1 new_var2
0 1 [7, 3] [1, 1] ['5. Social Sciences' '6. Humanities' '1. Natural Sciences'] [Social Sciences, Humanities, Natural Sciences] Social Sciences, Humanities, Natural Sciences
1 2 [1, 5] [2, 1] ['2. Engineering and Technology' '5. Social Sciences'\n '4. Agricultural Sciences'] [Engineering and Technology, Social Sciences, Agricultural Sciences] Engineering and Technology, Social Sciences, Agricultural Sciences
2 3 [1, 2, 6] [2, 0, 2] [nan] []
3 5 [1, 2] [2, 0] ['1. Natural Sciences'] [Natural Sciences] Natural Scienceshttps://stackoverflow.com/questions/74143758
复制相似问题