我正在寻找一个答案,喜欢这个,但在python。如何对多列进行文本预处理?我有两个文本列--参见截图。要做清洁工作,我必须对每一列做两次(见我的代码)。有什么聪明的方法来完成类似的任务吗?谢谢!
import requests
from bs4 import BeautifulSoup #html.parser'
df['Summary'] = [BeautifulSoup(text).get_text() for text in df['Summary']]
df['Text'] = [BeautifulSoup(text).get_text() for text in df['Text']]
df.loc[:,"Text"] = df.Text.apply(lambda x : str.lower(x))
df.loc[:,"Summary"] = df.Summary.apply(lambda x : str.lower(x))
#remove punctuation.
df["Text"] = df['Text'].str.replace('[^\w\s]','')
df["Summary"] = df['Summary'].str.replace('[^\w\s]','')发布于 2019-09-25 18:19:34
尝试这段代码
使用REGEX::
import re
def preprocess_text(text):
""" Apply any preprocessing methods"""
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
return text
df["Text"] = df.Text.apply(preprocess_text)
df["Summary"] = df.Summary.apply(preprocess_text)使用字符串库的:
from string import punctuation
def preprocess_text(text):
""" Apply any preprocessing methods"""
text = text.lower()
text = ''.join(c for c in text if c not in punctuation)
return text
df["Text"] = df.Text.apply(preprocess_text)
df["Summary"] = df.Summary.apply(preprocess_text)注意:要了解关于文本预处理任务的更多信息,可以阅读本博客https://medium.com/@pemagrg/pre-processing-text-in-python-ad13ea544dae。
https://stackoverflow.com/questions/58088426
复制相似问题