我希望清理以下数据帧的配置文件列:
name profile
6 Pedro ["\n Design ...
7 Leonardo ["\n Design ...
8 Daniel ["\n JavaScript ...
9 Mario ["\n JavaScript ...
10 Christi ["\n Design ...我已经在单独的行上测试了以下函数...
def clean_profile(row):
for index, row in new_df2["profile"].items():
str_row = str(row)
clean_row = (
'""'.join(str_row)
.replace(",", "")
.replace('""', "")
.replace("\\n ", "")
.replace(" ", "")
)
return clean_row...and找到它来转换这个字符串:
'["\\n Design ","\\n Design "]'添加到这个清理过的字符串:
'["Design","Design"]'(额外的替换方法对于清理非常杂乱的字符串是必要的,就像下面这样:)
'{"Tools "" Google Analytics ":null," Google Adsense ":null," MailChimp ":null," Google Adwords ","Containers "" Docker ","Digital "" SEO ":null," Email Marketing ":null," Article Writing ":null," Market Research ":null," Social Media ":null," Inbound Marketing ","*Nix "" Ubuntu ":null," Linux ","Java "" Java ","Python "" Django ":null," Python ":null," Flask ","Databases "" MySQL Management ":null," MongoDB Management ":null," PostgreSQL Management ","Visual "" Brand Design ":null," Graphic Design ":null," Logo Design ","HTML "" HTML ","Version Control "" Git ","PHP "" Laravel ":null," Wordpress ":null," PHP ":null," Symfony ","Mobile "" React Native ","Ruby "" Ruby ":null," Sinatra ":null," Rails ","Project Management "" Agile Methodology ":null," Client Management ":null," Scrum ","English "" Written English ":null," Spoken English ","Configuration Management "" Chef ","Webserver "" Nginx ":null," Apache ","CDN "" AWS CloudFront ":null," Cloudflare ","Other "" C++ ","Experience "" Creative Direction ":null," UI/UX Design ":null," Wireframing ","JavaScript "" JavaScript ":null," TypeScript ":null," Redux ":null," Angular JS ":null," Angular ":null," D3.js ":null," Node.js ":null," React ":null," Flux ":null," Express ","CSS "" SASS ":null," LESS ":null," CSS ","Hosting "" Heroku ":null," Digital Ocean ":null," AWS ","Automated Testing "" TDD ":null," Automated Testing ":null," BDD ":null," Jest ","Traditional "" Outbound Marketing ":null," Brand Strategy ","Data Science "" Data Science ":null," Data Analysis ":null," Machine Learning ":null," Data Visualization ":null," R ":null," Statistics "}'当我循环遍历dataframe的所有行时,我得到对所有行重复的结果:
["JavaScriptDevOpsPHPJavaScriptDevOpsPHP"]或者这样:
<function clean_profile at 0x0845CB20>我已经尝试了一些不同的方法,但是worked...is没有人能够解释这里发生了什么,也许可以建议一个更好的方法来清理这些字符串?
谢谢!
发布于 2020-09-10 14:06:02
看起来您没有在for循环外部声明初始clean_row变量,因此您的clean_row将始终等于您上次清除的任何字符串。
def clean_profile(row):
clean_row = "" //added this line
for index, row in new_df2["profile"].items():
str_row = str(row)
clean_row = (
'""'.join(str_row)
.replace(",", "")
.replace('""', "")
.replace("\\n ", "")
.replace(" ", "")
)
return clean_row此外,我还会研究一下用于字符串清理的strip函数。这里有一个很好的例子:https://www.geeksforgeeks.org/clean-the-string-data-in-the-given-pandas-dataframe/
https://stackoverflow.com/questions/63164346
复制相似问题