文章/答案/技术大牛

发布

社区首页 >问答首页 >连(sentence.split())查找句子中字数的改进

问连(sentence.split())查找句子中字数的改进
EN

Stack Overflow用户

提问于 2022-08-10 13:29:37

回答 2查看 141关注 0票数 1

在python中查找句子中的单词数的标准方法似乎是使用len(sentence.split(<separator character(s)>))。然而，这总是让我感到非常低效，因为您正在创建一个完整的列表，只是为了计数分隔符字符的数量，而不是前面有其他分隔符。

我知道写一个函数来做同样的工作很简单，但是

( a)在你需要做的任何地方做这件事都有点笨重，而且

( b)由于函数是用python编写的，而不是内置的C函数，所以它的效率可能也不会高得多。

因此，任何人都会想到用一条有效的单线线来代替len(string.split())。

编辑:要澄清string.count(“") +1并不能实现与len(string.split(”"))计数数、前导、尾随和重复字符相同的目的，因此它不是计算字数的好方法，因为它只在理想化的情况下工作，即字符串以单词开头，以单词结尾，没有双空格。

python

python-3.x

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-08-11 02:27:24

就时间而言，如果不是不可能的话，len(string.split())也很难打败。但是，如果空间是一个问题，它可能是有可能改善。

让我们先来看看在分裂法中构建的

i = j = 0;
while (maxcount-- > 0) {
    while (i < str_len && STRINGLIB_ISSPACE(str[i]))
        i++;
    if (i == str_len) break;
    j = i; i++;
    while (i < str_len && !STRINGLIB_ISSPACE(str[i]))
        i++;
#if !STRINGLIB_MUTABLE
    if (j == 0 && i == str_len && STRINGLIB_CHECK_EXACT(str_obj)) {
        /* No whitespace in str_obj, so just use it as list[0] */
        Py_INCREF(str_obj);
        PyList_SET_ITEM(list, 0, (PyObject *)str_obj);
        count++;
        break;
    }
#endif
    SPLIT_ADD(str, j, i);
}

maxcount:是一个参数，它可以将字符串限制为只拆分x次。如果没有设定，就没有限制。

函数在字符串上循环，查找空格，并相应地将单词拆分成一个列表。时间复杂度是O(n)，空间复杂度也是O(n)，因为它需要一个字符串大小的列表的辅助空间。

可以通过编写一个在字符串计数空间上循环的函数来改进空间，而不是在我的测试中花费时间。我的功能如下：

def count_words(s):
    space_prev = True
    word_count = 0
    index = 0

    while index < len(s):
        if s[index] == ' ' and not space_prev:
            word_count += 1
            space_prev = True
        elif space_prev and s[index] != ' ':
            space_prev = False
        index += 1

    if space_prev:
        return word_count

    return word_count + 1

该函数的时间复杂度为O(n)，空间复杂度为O(1)。尽管它是用Python编写的，但运行速度比内置的C函数慢，尽管它具有相同的算法运行时，并且节省了空间。

下面是我的测试时间(第一个数字是时间，第二个是空间)：

String with 69 words:
---------------------
builtin: 0.0009369159999999994, 11.567104 mb
count_words: 0.006606459, 11.583488 mb

String with 101 words:
----------------------
builtin: 0.0009850830000000012, 11.517952 mb
count_words: 0.009976667000000002, 11.567104 mb

String with 1,010 words:
------------------------
builtin: 0.007890958999999996, 11.632640 mb
count_words: 0.101723584, 11.567104 mb

String with 10,100 words:
-------------------------
builtin: 0.044037375, 11.911168 mb
count_words: 1.026273333, 11.747328 mb

最后，这个空间也是相当微不足道的。所以，除非你有大量的话要数，否则这是不值得的努力。

票数 2

Stack Overflow用户

发布于 2022-08-10 13:41:49

正如注释中提到的，您可以使用

string.count(" ") + 1

也就是说，根据我刚才做的一个快速测试，速度更快。但是，如果存在连续空格("I am"将表示3个单词)或字符串有前导或尾随空格(" I am "将为4个单词)，则会很快遇到问题。长话短说，split()经过优化可以准确地分离单词，因此它可以自行处理许多头痛问题，即使速度较慢。

速度测试

以下是一些基准：

from timeit import timeit

string = 'During a combat medical training class, the topic was blast injuries. At one point, our very intimidating instructor pointed at me and said, “There’s been a jeep explosion. What would you do if you came upon an injured man with a steering wheel embedded in his chest?” Nervous and unsure, I blurted out, “Drive him to the hospital?” For some reason, the rest of the room found this hilarious.'

def builtin_split():
    return len(string.split())

def builtin_count():
    return string.count(" ") + 1

def improved_count():
    # Takes care of leading, trailing and double spaces.
    # Doesn't take care of triple and longer
    stripped_string = string.strip()
    return stripped_string.strip().count(" ") + 1 - stripped_string.count("  ")

def custom_implementation():
    # Only uses spaces as separators
    s = string.strip()
    length = len(s)
    words = 0
    i = 0
    while i < length:
        while i < length and s[i] != " ":
            i += 1
        while i < length and s[i] == " ":
            i += 1
        words += 1
    
    return words


print("split:" ,timeit(builtin_split))
print("count:", timeit(builtin_count))
print("improved count:", timeit(improved_count))
print("custom implementation:", timeit(custom_implementation))

输出：

split: 4.68428400799985
count: 1.250697097000284
improved count: 2.0560090240001045
custom implementation: 123.93436312599988

仅使用简单的count大约快3.5倍，但正如前面提到的，不可靠。当您改进count以处理尾随和前导空间以及双空格时，它仍然比使用split快2倍以上。这可能涵盖了count的绝大部分错误，但仍然无法在具有三重或更多空格("I am")的字符串上获得正确的结果。我想说的是，这通常还远远不值得小小的业绩提升。正如预期的那样，Python中的自定义实现非常慢。

票数 -1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73307214

复制

相似问题

问连(sentence.split())查找句子中字数的改进
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问连(sentence.split())查找句子中字数的改进EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问连(sentence.split())查找句子中字数的改进
EN