首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >我想从简历中提取特定的部分,简历。

我想从简历中提取特定的部分,简历。
EN

Stack Overflow用户
提问于 2021-06-12 11:39:19
回答 1查看 416关注 0票数 0

我想从简历或CV...like教育、经验中提取一个特定的章节。我这样做了,但当教育或其他部分最后写在简历上时,它就行不通了。

代码语言:javascript
复制
` def extract_experience(ex_cl):   #create function of experience
     doc= fitz.open(ex_cl)   #open pdf file
     text=""             #crate string
     for page in doc:
     text= text + str(page.getText())  #conver pdf text into string
     words= nltk.word_tokenize(text)  #convert all text of CV into words

     start = 0
     end= 0

     #manually create [exp_list] which contain all CVs titles are possibel [not including 
     experience word](lan= german and English)

     exp_list=["FÄHIGKEITEN","KENNTNISSE","AUSBILDUNG","Ausbildung", "BILDUNG", "Bildung", 
                "Hobbies","HOBBIES","Personliche","Fahigkeiten",
                "Kenntnisse","Ehrenamtliches","Engagement",
                 "Sprachen","SPRACHEN","EHRENAMTLICHES",
                  "ENGAGEEMENT","EDUCATION" ,"Education","Hochschul",
                    "HOCHSCHUL","Studium","STUDIUM","Sprachkurse","Computerkenntnisse",
                  "SPRACHEN","SPRACHKURSE","COMPUTERKENNTNISSE", 
           "AWARDS","Awards","PERSONAL","Personal","Information", "INFORMATION",
           "SKILLS","Skills","SKILL","Skill",'Soziales']

     #manually create  [exp] list which contain experience title and also synonym words of 
     experiance word


     exp=['Erfahrung' ,'Laufbahn','ERFAHRUNG' ," Erfahrungen" ,'LAUFBAHN','Praktische',
                                                           
          'PRAKTISCHE','ERFAHRUNGEN','Praktika','PRAKTIKA' ,
         'Berufserfahrung' ,'EXPERIENCE','Experience' ,'BERÜFSERFAHRUNG','Berufserfahrung']

     for vari in words:        # Match experience word or synonym word from CV and manually 
                                                         created list[exp]  
        if vari in exp:         # if match then find index of that word
          st=words.index(vari)
          start= st+1           #(st+1)for take next word  
                            # get index of experience word of CV
          i = start             #give another variable(i)
     for j in words:                          #create for loop
        if words[i]  not in exp_list:   #if  start index is not in [exp_list(without 
                                                             experience 
                                                                                       word)] 
           i += 1                        #then take next index untill it match the word 
                                                                       of[exp_list]
           end= start+(i-start)               # find end index 
      
 
    

      f_list=[]  #create list
      for item in words[start:end]: #give slicing for take start index and end index
         f_list.append(item)  #append into list
      stringlist = ' '.join(f_list )  #convert into string


      return stringlist

extract_experience('020.pdf')

`

EN

回答 1

Stack Overflow用户

发布于 2022-02-08 11:56:08

您可以首先使用Apache从pdf中提取文本。它有助于正确整理课文。但是要提取这些部分需要一些脏代码。

我会尝试基于多个换行符(/n/n或更多)提取节。一种更脏的方法是根据启发式方法(如

  • 创建可能的标题节列表
  • 遍历文本
  • 如果标题部分:
    • 开始从该索引开始计算,直到它满足另一节的标题大写或大写(这有助于避免像技能与技能或技能这样的案例),就像我提到的-肮脏。

一种可能更简洁的方法是在SpaCy中使用NER (命名实体识别)。但是,您必须创建一个CVs数据集,并手动标记每个部分。

请参见使用NER提取所需信息的这个回购

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67948506

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档