我已经创建了2个python集创建自2个不同的CSV文件,其中包含一些刺。
我正在尝试匹配这两个集合,以便它将返回2的交集(应该返回两个集合中的公共字符串)。
下面是我的代码:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk
#using content mmanager to open and read file
#converted the text file into csv file at the source using Notepad++
with open(r'skills.csv', 'r', encoding="utf-8-sig") as f:
myskills = f.readlines()
#converting mall the string in the list to lowercase
list_of_myskills = map(lambda x: x.lower(), myskills)
set_of_myskills = set(list_of_myskills)
#print(type(nodup_filtered_content))
print(set_of_myskills)
#open and read by line from the text file
with open(r'list_of_skills.csv', 'r') as f2:
#using readlines() instead of read(), becasue it reads line by line (each
line as a string obj in the python list)
contents_f2 = f2.readlines()
#converting mall the string in the list to lowercase
list_of_skills = map(lambda x: x.lower(), contents_f2)
#converting into sets
set_of_skills = set(list_of_skills)
print(set_of_skills)这是我正在使用的函数:
def set_compare(set1,set2):
if(set1 & set2):
return print('The matching skills are: '(set1 & set2))
else:
print("No matching skills")在我运行代码之后:
set_compare(set_of_skills,set_of_myskills)输出:
No matching skills'skills.csv‘的内容是:
{'critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate,'}‘list_ of _skills.csv’文件的内容:
{'assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently,'}虽然我可以实际看到匹配的关键字,但我不明白为什么我得不到输出。
也没有收到任何错误的
发布于 2018-07-13 06:28:10
比较两组字符串不会比较这些字符串的子字符串。你的程序实际上要做的是
foo = {'ABC', 'DEF', 'GHI'}
bar = {'AB', 'CD', 'DE', 'FG', 'HI'}
foo.intersection(bar) # returns {}不同集合中的字符串之间存在共享的字符并不意味着集合之间存在交集。字符串'ABC'在第一个而不是第二个,字符串'AB'在第二个而不是第一个,依此类推。
这有点不清楚你到底想要比较两个csv的交集是什么。你想找出两个csv中的单个单元吗?它们是否也必须在列中匹配?如果您提供了有关预期输出的更多信息,那么我可以编辑此答案以提供更多信息。
编辑每条评论,看起来你想要的是用逗号分割那些巨大的字符串,这样集合中的元素就变成了单独的单元格。目前,这些集合都只有一个元素,每个元素都只是一个包含大量技能的巨大字符串。如果您替换
list_of_myskills = map(lambda x: x.lower(), myskills)使用
list_of_myskills = [y.strip().lower() for x in myskills for y in x.split(',')]并相应地替换其他类似的行,那么您可能会更接近您的预期。
发布于 2018-07-13 06:56:09
这是可行的:更改.csv文件,使其包含技能单词,中间用",“分隔。每个文件一行。
import pandas as pd
myskills = pd.read_csv("skills.csv",header=None)
set_of_my_skills = set(myskills.iloc[0,])
list_of_skills = pd.read_csv("list_of_skills.csv",header=None)
set_of_skills = set(list_of_skills.iloc[0,])
print(set_of_my_skills & set_of_skills)
{'business intelligence', 'design', 'critical thinking', 'data analysis', 'database', 'teamwork'}
skills.csv : critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate
list_of_skills.csv: assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independentlyhttps://stackoverflow.com/questions/51315047
复制相似问题