文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用python对单词执行良好的标记化

问如何使用python对单词执行良好的标记化
EN

Stack Overflow用户

提问于 2020-07-14 11:54:25

回答 1查看 183关注 0票数 0

我在python中有一个函数，它使用记号器将句子拆分成单词。问题是，当我运行这个函数时，返回的输出是一个没有空格的单词。

实句：

“爱Picture2Life.com！你们都是为iphone而不是黑莓而设计的有趣的应用吗？！”

result:

'islovinpicturelifecomyallfunappsrforiphoneandnotblackberry‘

结果必须是这样的：是爱图2的生活。com.

代码：

ppt = '''...!@#$%^&*()....{}’‘ “”  “[]|._-`/?:;"'\,~12345678876543'''

#tekonize helper function
def text_process(raw_text):
    '''
    parameters:
    =========
    raw_text: text as input
    functions:
    ==========
    - remove all punctuation
    - remove all stop words
    - return a list of the cleaned text

    '''
    #check characters to see if they are in punctuation
    nopunc = [char for char in list(raw_text) if char not in ppt]

    
    
    # join the characters again to form the string
    nopunc = "".join(nopunc)
    
    #now just remove ant stopwords
     
    words = [word for word in nopunc.lower().split() if   word.lower() not in stopwords.words("english")]
    return words

ddt= data.text[2:3].apply(text_process)
print("example: {}".format(ddt))

python

pandas

dataframe

tokenize

回答 1

Stack Overflow用户

发布于 2020-07-14 18:35:23

好吧，在你的第一行

ppt = '''...!@#$%^&*()....{}’‘ “”  “[]|._-`/?:;"'\,~12345678876543'''

在‘ “” “序列中包含空白字符，因此它在运行列表理解时将删除所有空白(因此也包括空格)：

nopunc = [char for char in list(raw_text) if char not in ppt]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62894593

复制

相似问题

问如何使用python对单词执行良好的标记化
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用python对单词执行良好的标记化EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用python对单词执行良好的标记化
EN