我试图将给定文本中的表情符号与其他字符/单词/表情符号分开。我想稍后使用表情符号作为文本分类的特征。所以,重要的是,我把句子中的每一个表情符号分别作为一个单独的字符来处理。
守则:
import re
text = "I am very #happy man but my wife is not "
print(text) #line a
reg = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]+',
re.UNICODE)
#padding the emoji with space at both the ends
new_text = reg.sub(' \1 ',text)
print(new_text) #line b
# this is just to test if it can still identify the emoji in new_text
new_text2 = reg.sub('#\1#', new_text)
print(new_text2) # line c以下是实际输出:

(我不得不粘贴屏幕截图,因为从终端粘贴输出的副本扭曲了b和c行中已经失真的表情符号)
这是我的预期输出:
I am very #happy man but my wife is not
I am very #happy man but my wife is not
I am very #happy man but ## ## my wife ## is not ## ## 问题:
1)为什么搜索和替换不像预期的那样有效?用什么取代表情符号?(b行)。它绝对不是原始表情符号的unicode,否则c行就会打印两端填充#的表情符号。
2)我不确定我是否是对的,但是-为什么分组表情符号会被一个单一的表情符号/unicode所取代?(b项)
发布于 2017-05-21 19:30:57
这里有几个问题。
\1反向引用定义为组1-因此,最自然的解决方法是使用对组0的反向引用,即整个匹配,即\g<0>。\1实际上并不被解析为反向引用,而是被解析为八进制值为1的字符,因为常规字符串(而不是原始字符串)中的反斜杠形成转义序列。在这里,它是八进制逃逸。+后面的]意味着regex引擎必须匹配一个或更多匹配字符类的文本,因此您需要匹配表情符号序列,而不是每个单独的表情符号。使用
import re
text = "I am very #happy man but my wife is not "
print(text) #line a
reg = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]',
re.UNICODE)
#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',text)
print(new_text) #line b
# this is just to test if it can still identify the emojis in new_text
new_text2 = reg.sub(r'#\g<0>#', new_text)
print(new_text2) # line c见Python演示打印
I am very #happy man but my wife is not
I am very #happy man but my wife is not
I am very #happy man but ## ## my wife ## is not ## ## https://stackoverflow.com/questions/44100804
复制相似问题