我想从简历或CV...like教育、经验中提取一个特定的章节。我这样做了,但当教育或其他部分最后写在简历上时,它就行不通了。
` def extract_experience(ex_cl): #create function of experience
doc= fitz.open(ex_cl) #open pdf file
text="" #crate string
for page in doc:
text= text + str(page.getText()) #conver pdf text into string
words= nltk.word_tokenize(text) #convert all text of CV into words
start = 0
end= 0
#manually create [exp_list] which contain all CVs titles are possibel [not including
experience word](lan= german and English)
exp_list=["FÄHIGKEITEN","KENNTNISSE","AUSBILDUNG","Ausbildung", "BILDUNG", "Bildung",
"Hobbies","HOBBIES","Personliche","Fahigkeiten",
"Kenntnisse","Ehrenamtliches","Engagement",
"Sprachen","SPRACHEN","EHRENAMTLICHES",
"ENGAGEEMENT","EDUCATION" ,"Education","Hochschul",
"HOCHSCHUL","Studium","STUDIUM","Sprachkurse","Computerkenntnisse",
"SPRACHEN","SPRACHKURSE","COMPUTERKENNTNISSE",
"AWARDS","Awards","PERSONAL","Personal","Information", "INFORMATION",
"SKILLS","Skills","SKILL","Skill",'Soziales']
#manually create [exp] list which contain experience title and also synonym words of
experiance word
exp=['Erfahrung' ,'Laufbahn','ERFAHRUNG' ," Erfahrungen" ,'LAUFBAHN','Praktische',
'PRAKTISCHE','ERFAHRUNGEN','Praktika','PRAKTIKA' ,
'Berufserfahrung' ,'EXPERIENCE','Experience' ,'BERÜFSERFAHRUNG','Berufserfahrung']
for vari in words: # Match experience word or synonym word from CV and manually
created list[exp]
if vari in exp: # if match then find index of that word
st=words.index(vari)
start= st+1 #(st+1)for take next word
# get index of experience word of CV
i = start #give another variable(i)
for j in words: #create for loop
if words[i] not in exp_list: #if start index is not in [exp_list(without
experience
word)]
i += 1 #then take next index untill it match the word
of[exp_list]
end= start+(i-start) # find end index
f_list=[] #create list
for item in words[start:end]: #give slicing for take start index and end index
f_list.append(item) #append into list
stringlist = ' '.join(f_list ) #convert into string
return stringlist
extract_experience('020.pdf')`
发布于 2022-02-08 11:56:08
您可以首先使用Apache从pdf中提取文本。它有助于正确整理课文。但是要提取这些部分需要一些脏代码。
我会尝试基于多个换行符(/n/n或更多)提取节。一种更脏的方法是根据启发式方法(如
一种可能更简洁的方法是在SpaCy中使用NER (命名实体识别)。但是,您必须创建一个CVs数据集,并手动标记每个部分。
请参见使用NER提取所需信息的这个回购
https://stackoverflow.com/questions/67948506
复制相似问题