我想用Python中的NLP从文本中提取有关教育机构、学位、及格年份和年级(CGPA/GPA/百分比)的信息。例如,如果我有输入:
NBN Sinhgad工程学院,浦那2016-2020工程计算机科学学士CGPA: 8.78 Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2014 - 2016中级-PCM,经济CBSE百分比: 88.8 Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2003 -2014年预科,CBSE CGPA: 8.6经验
我要出狱:
[{
"Institute": "NBN Sinhgad School Of Engineering",
"Degree": "Bachelor of Engineering Computer Science",
"Grades": "8.78",
"Year of Passing": "2020"
}, {
"Institute": "Vidya Bharati Chinmaya Vidyalaya",
"Degree": "Intermediate-PCM,Economics",
"Grades": "88.8",
"Year of Passing": "2016"
}, {
"Institute": "Vidya Bharati Chinmaya Vidyalaya",
"Degree": "Matriculation,CBSE",
"Grades": "8.6",
"Year of Passing": "2014"
}]
如果不训练任何定制的NER模型,它能做到吗?有没有受过预先训练的人可以这么做?
发布于 2022-11-25 05:26:39
是的,不需要训练任何自定义的NER模型就可以解析数据。您必须构建自定义规则来解析数据。
在您的例子中,您可以通过正则表达式和模式识别来提取数据,就像研究所总是在经过一年或其他什么的。如果它不是无序的,那么您必须按关键字(如school, institute,college ans so on... )进行排序,这取决于您的情况。
import re
txt = '''NBN Sinhgad School Of Engineering,Pune 2016 - 2020 Bachelor of Engineering Computer Science CGPA: 8.78
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2014 - 2016 Intermediate-PCM,Economics CBSE Percentage: 88.8
Vidya Bharati Chinmaya Vidyalaya,Jamshedpur 2003 - 2014 Matriculation,CBSE CGPA: 8.6 EXPERIENCE'''
# extract grades
grade_regex = r'(?:\d{1,2}\.\d{1,2})'
grades = re.findall(grade_regex, txt)
# extract years
year_regex = r'(?:\d{4}\s?-\s?\d{4})'
years = re.findall(year_regex, txt)
# function to replace a value in string
def replacer(string, noise_list):
for v in noise_list:
string = string.replace(v, ":")
return string
# extract college
data = replacer(txt, years)
cleaned_text = re.sub("(?:\w+\s?\:)", "**", data).split('\n')
college = []
degree = []
for i in cleaned_text:
split_data = i.split("**")
college.append(split_data[0].replace(',', '').strip())
degree.append(split_data[1].strip())
parsed_output = []
for i in range(len(grades)):
parsed_data = {
"Institute": college[i],
"Degree": degree[i],
"Grades": grades[i],
"Year of Passing": years[i].split('-')[1]
}
parsed_output.append(parsed_data)
print(parsed_output)
>>>> [{'Institute': 'NBN Sinhgad School Of Engineering', 'Degree': 'Bachelor of Engineering Computer Science', 'Grades': '8.78', 'Year of Passing': ' 2020'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Intermediate-PCM,Economics CBSE', 'Grades': '88.8', 'Year of Passing': ' 2016'}, {'Institute': 'Vidya Bharati Chinmaya Vidyalaya', 'Degree': 'Matriculation,CBSE', 'Grades': '8.6', 'Year of Passing': ' 2014'}]https://stackoverflow.com/questions/74552512
复制相似问题