文章/答案/技术大牛

发布

社区首页 >问答首页 >预处理数据:删除用于文本分析的意大利语词组

问预处理数据:删除用于文本分析的意大利语词组
EN

Stack Overflow用户

提问于 2022-05-12 10:39:50

回答 2查看 223关注 0票数 1

我想用这个函数删除意大利语的停顿，但我不知道我能做什么。我已经看过几个带有停止字移除的脚本，但总是在标记器之后。以前有可能吗？我的意思是，我想要在标记之前没有停顿的文本。对于停止词，我使用了这个库:停止词

! pip install stop-words
from stop_words import get_stop_words

stop = get_stop_words('italian')

    import re
# helper function to clean tweets
def processTweet(tweet):
    # Remove HTML special entities (e.g. &amp;)
    tweet = re.sub(r'\&\w*;', '', tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','',tweet)
    # Remove tickers
    tweet = re.sub(r'\$\w*', '', tweet)
    # To lowercase
    tweet = tweet.lower()
    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*\/\w*', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#\w*', '', tweet)
    # Remove Punctuation and split 's, 't, 've with a space for filter
    tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#)|(\w+:\/\/\S+)|(\S*\d\S*)|([,;.?!:])",
                                           " ", tweet).split())
    #tweet = re.sub(r'[' + punctuation.replace('@', '') + ']+', ' ', tweet)
    # Remove words with 2 or fewer letters
    tweet = re.sub(r'\b\w{1,3}\b', '', tweet)
    # Remove whitespace (including new line characters)
    tweet = re.sub(r'\s\s+', ' ', tweet)
    # Remove single space remaining at the front of the tweet.
    tweet = tweet.lstrip(' ') 
    # Remove characters beyond Basic Multilingual Plane (BMP) of Unicode:
    tweet = ''.join(c for c in tweet if c <= '\uFFFF') 
    return tweet
df['text'] = df['text'].apply(processTweet)

python

function

stop-words

data-preprocessing

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-05-12 10:53:29

只需使用re.sub()，就像您一直使用的那样：

exclusions = '|'.join(stop)
tweet = re.sub(exclusions, '', tweet)

票数 1

Stack Overflow用户

发布于 2022-05-12 10:55:11

请考虑以下示例

import re
stops = ["and","or","not"] # list of words to remove
text = "Band and nothing else!" # and in Band and not in nothing should stay
pattern = r'\b(?:' + '|'.join(re.escape(s) for s in stops) + r')\b'
clean = re.sub(pattern, '', text)
print(clean)

输出

Band  nothing else!

说明：re.escape处理在正则表达式模式中具有特殊意义的字符(例如.)并将它们转换为文字版本(因此re.escape(".")与文字.不匹配任何字符)，|是可选的，使用所有单词的连接替代是构建的，(?:.)是非捕获组，它允许我们在开始时使用一个\b，在结束时使用一个\b，而不是每个单词。\b是单词边界，这里用来确保只删除整个单词，而不是例如将Band转换为B。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72214118

复制

相似问题

问预处理数据:删除用于文本分析的意大利语词组
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问预处理数据:删除用于文本分析的意大利语词组EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问预处理数据:删除用于文本分析的意大利语词组
EN