首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用Python语言从文本中提取有关教育机构、年级、年份和学位的信息

用Python语言从文本中提取有关教育机构、年级、年份和学位的信息
EN

Stack Overflow用户
提问于 2022-11-23 20:14:50
回答 1查看 42关注 0票数 0

我想用Python中的NLP从文本中提取有关教育机构、学位、及格年份和年级(CGPA/GPA/百分比)的信息。例如,如果我有输入:

NBN Sinhgad工程学院,浦那2016-2020工程计算机科学学士CGPA: 8.78 Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2014 - 2016中级-PCM,经济CBSE百分比: 88.8 Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2003 -2014年预科,CBSE CGPA: 8.6经验

我要出狱:

代码语言:javascript
复制
[{
  "Institute": "NBN Sinhgad School Of Engineering",
  "Degree": "Bachelor of Engineering Computer Science",
  "Grades": "8.78",
  "Year of Passing": "2020"
}, {
  "Institute": "Vidya Bharati Chinmaya Vidyalaya",
  "Degree": "Intermediate-PCM,Economics",
  "Grades": "88.8",
  "Year of Passing": "2016"
}, {
  "Institute": "Vidya Bharati Chinmaya Vidyalaya",
  "Degree": "Matriculation,CBSE",
  "Grades": "8.6",
  "Year of Passing": "2014"
}]

如果不训练任何定制的NER模型,它能做到吗?有没有受过预先训练的人可以这么做?

EN

回答 1

Stack Overflow用户

发布于 2022-11-25 05:26:39

是的,不需要训练任何自定义的NER模型就可以解析数据。您必须构建自定义规则来解析数据。

在您的例子中,您可以通过正则表达式和模式识别来提取数据,就像研究所总是在经过一年或其他什么的。如果它不是无序的,那么您必须按关键字(如school, institute,college ans so on... )进行排序,这取决于您的情况。

代码语言:javascript
复制
import re

txt = '''NBN Sinhgad School Of Engineering,Pune 2016 - 2020 Bachelor of Engineering Computer Science CGPA: 8.78 
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2014 - 2016 Intermediate-PCM,Economics CBSE Percentage: 88.8
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2003 - 2014 Matriculation,CBSE CGPA: 8.6 EXPERIENCE'''

# extract grades
grade_regex = r'(?:\d{1,2}\.\d{1,2})'
grades = re.findall(grade_regex, txt)

# extract years
year_regex = r'(?:\d{4}\s?-\s?\d{4})'
years = re.findall(year_regex, txt)


# function to replace a value in string
def replacer(string, noise_list):
    for v in noise_list:
        string = string.replace(v, ":")
    return string


# extract college
data = replacer(txt, years)
cleaned_text = re.sub("(?:\w+\s?\:)", "**", data).split('\n')
college = []
degree = []
for i in cleaned_text:
    split_data = i.split("**")
    college.append(split_data[0].replace(',', '').strip())
    degree.append(split_data[1].strip())
parsed_output = []
for i in range(len(grades)):
    parsed_data = {
        "Institute": college[i],
        "Degree": degree[i],
        "Grades": grades[i],
        "Year of Passing": years[i].split('-')[1]
    }
    parsed_output.append(parsed_data)
print(parsed_output)

>>>> [{'Institute': 'NBN Sinhgad School Of Engineering', 'Degree': 'Bachelor of Engineering Computer Science', 'Grades': '8.78', 'Year of Passing': ' 2020'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Intermediate-PCM,Economics CBSE', 'Grades': '88.8', 'Year of Passing': ' 2016'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Matriculation,CBSE', 'Grades': '8.6', 'Year of Passing': ' 2014'}]
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74552512

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档