首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何从熊猫数据中提取缩略语和缩写?

如何从熊猫数据中提取缩略语和缩写?
EN

Stack Overflow用户
提问于 2021-12-20 08:08:49
回答 1查看 191关注 0票数 0

我有一个pandas dataframe,一个包含文本数据的列。我想在文本列中提取所有惟一的缩略语、缩写

到目前为止,我有一个函数从给定的文本中提取所有缩略语和缩写

代码语言:javascript
复制
def extract_acronyms_abbreviations(text):
    eaa = {}
    for match in re.finditer(r"\((.*?)\)", text):
        start_index = match.start()
        abbr = match.group(1)
        size = len(abbr)
        words = text[:start_index].split()[-size:]
        definition = " ".join(words)

        eaa[abbr] = definition


    return eaa
代码语言:javascript
复制
extract_acronyms_abbreviations(a)
代码语言:javascript
复制
{'FHH': 'family health history', 'NP': 'nurse practitioner'}

我希望从文本列中应用/提取所有唯一的缩略语和缩写

样本数据:

代码语言:javascript
复制
s = """The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People's Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). This trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI."""
k = """The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation."""
j = """The key global provider of sustainable packaging solutions has now taken a significant step towards reaching these ambitions by signing two 10-year virtual Power Purchase Agreements (VPPA) with global renewable energy developer BayWa r.e covering its operations in Europe. The agreements form the largest solar VPPA for the packaging industry in Europe, as well as the first major solar VPPA by a Finnish company."""
a = """Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."""
代码语言:javascript
复制
import pandas as pd

data = {"text":[s,k,j,a,s,k,j]}
df = pd.DataFrame(data)

期望输出

代码语言:javascript
复制
{'MSWC': 'Multilingual Spoken Words Corpus',
'DCAI': 'proponent of Data-Centric AI',
'VPPA': 'virtual Power Purchase Agreements',
'NP': 'nurse practitioner',
'FHH': 'family health history'}
EN

回答 1

Stack Overflow用户

发布于 2021-12-20 08:36:19

假设df['text']包含要处理的文本数据。

代码语言:javascript
复制
df["acronyms"] = df.apply(extract_acronyms_abbreviations)
# It will create a new columns containing dictionary return by your function.

现在,创建一个主字典,如

代码语言:javascript
复制
master_dict = dict()
for d in df["acronyms"].values:
    master_dict.update(d)
print(master_dict)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70418959

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档