问用Python 3过滤掉文本中的所有非汉字字符
EN

Stack Overflow用户

提问于 2015-10-26 04:55:04

回答 1查看 3.1K关注 0票数 4

我有一个文本，其中有拉丁字母和日语字符(平假名，片假名和汉字)。

我想过滤掉所有的拉丁字符，平假名和片假名，但我不知道如何以一种优雅的方式做到这一点。我的直接方法是，除了每一个平假名/片假名之外，只过滤掉拉丁字母中的每一个字母，但我相信还有一个更好的方法。

我猜我必须使用regex，但我不太确定该如何做。信是用罗马字母，日文，中文等分类的吗?如果是，我可以用这个吗？

以下是一些示例文本：

"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"

程序只应返回汉字如下：

`私、人，方，皆`

character

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-10-27 12:26:38

多亏了reddit上的Olsgaarddk，我找到了答案。

regex.py

# -*- coding: utf-8 -*-
import re

''' This is a library of functions and variables that are helpful to have handy 
    when manipulating Japanese text in python.
    This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
    Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
    All rights reserved.
    Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''




## UNICODE BLOCKS ##

# Regular expression unicode blocks collected from 
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/

hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[｟-ﾟ]'
alphanum_full = r'[！-～]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'

## FUNCTIONS ##

def extract_unicode_block(unicode_block, string):
    ''' extracts and returns all texts from a unicode block from string argument.
        Note that you must use the unicode blocks defined above, or patterns of similar form '''
    return re.findall( unicode_block, string)

def remove_unicode_block(unicode_block, string):
    ''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
        Note that you must use the unicode blocks defined above, or patterns of similar form '''
    return re.sub( unicode_block, '', string)

## EXAMPLES ## 

text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕｟ﾟabc！～、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'

print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/33338713

复制

相似问题

问用Python 3过滤掉文本中的所有非汉字字符
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python 3过滤掉文本中的所有非汉字字符EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python 3过滤掉文本中的所有非汉字字符
EN