如何拆分字符串中的泰米尔语字符?
当我使用preg_match_all('/./u', $str, $results)时,
我得到了字符“த”,“ம”,“ி”,“ழ”和"்“。
如何获取组合字符“த”,“மி”和"ழ்"?
发布于 2012-01-10 11:55:24
我认为您应该能够使用the grapheme_extract function来迭代组合的字符(从技术上讲,这些字符被称为“字素簇”)。
或者,如果您更喜欢正则表达式方法,我认为您可以使用以下方法:
preg_match_all('/\pL\pM*|./u', $str, $results)其中\pL表示Unicode“字母”,\pM表示Unicode“标记”。
(免责声明:这两种方法我都没有测试过。)
发布于 2013-01-26 00:48:11
如果我没理解错的话,您有一个包含代码点的unicode字符串,您想把它转换成一个字母数组吗?
我正在开发一个开放源码的Python库,以便为Tamil Language website执行类似的任务。
我已经有一段时间没有使用PHP了,所以我将把逻辑放在这里。您可以查看amuthaa/TamilWord.py file's split_letters() function中的代码。
正如ruakh提到的,泰米尔语字素被构建为码点。
所以如果你的逻辑是这样的:
initialize an empty array
for each codepoint in word:
if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array
otherwise, the codepoint is a marking such as the pulli (i.e. ்) or one of the combination extensions (e.g. ி or ை), so append it to the end of the last element of the array当然,这假设您的字符串是格式良好的,并且您没有像一行中的两个标记这样的东西。
这是Python代码,如果您觉得它有帮助的话。如果你想帮助我们把这个移植到PHP,也请让我知道:
@staticmethod
def split_letters(word=u''):
""" Returns the graphemes (i.e. the Tamil characters) in a given word as a list """
# ensure that the word is a valid word
TamilWord.validate(word)
# list (which will be returned to user)
letters = []
# a tuple of all combination endings and of all அ combinations
combination_endings = TamilLetter.get_combination_endings()
a_combinations = TamilLetter.get_combination_column(u'அ').values()
# loop through each codepoint in the input string
for codepoint in word:
# if codepoint is an அ combination, a vowel, aytham or a space,
# add it to the list
if codepoint in a_combinations or \
TamilLetter.is_whitespace(codepoint) or \
TamilLetter.is_vowel(codepoint) or \
TamilLetter.is_aytham(codepoint):
letters.append(codepoint)
# if codepoint is a combination ending or a pulli ('்'), add it
# to the end of the previously-added codepoint
elif codepoint in combination_endings or \
codepoint == TamilLetter.get_pulli():
# ensure that at least one character already exists
if len(letters) > 0:
letters[-1] = letters[-1] + codepoint
# otherwise raise an Error. However, validate_word()
# should catch this
else:
raise ValueError("""%s cannot be first character of a word""" % (codepoint))
return lettershttps://stackoverflow.com/questions/8798248
复制相似问题