文章/答案/技术大牛

发布

社区首页 >问答首页 >计算文本java中出现的单词数

问计算文本java中出现的单词数
EN

Stack Overflow用户

提问于 2022-01-13 14:48:39

回答 2查看 845关注 0票数 -2

因此，我正在从头开始构建一个TreeMap，并尝试使用Java计算文本中每个单词的出现次数。文本是从文本文件中读取的，但是我可以轻松地从那里读取它。我真的不知道怎么数每一个字，有人能帮忙吗？

想象一下，文本是这样的：

随着时间的推移，计算机工程师们利用彼此的工作，发明了新事物的算法。算法与其他算法相结合，利用其他算法的结果，进而为更多的算法产生结果。

Output: 
Over 1
time 1
computer 1
algotitms 5
...

如果可能的话，我想忽略它是大写还是小写，我想把它们算在一起。

编辑:我不想使用任何类型的地图(hashMap，即)或者做类似的事情。

java

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-01-13 16:59:40

将问题分解如下(这是一个潜在的解决方案-而不是解决方案)：

results.

Iterate

将文本拆分为单词(创建列表或数组或单词)。

删除标点符号。

创建地图以收集单词列表上的，并在每个遇到的键

显示结果的值中添加"1“(遍历地图的
显示结果)

把课文分成几个字

我喜欢用空格作为分隔符来分割单词。原因是，如果你使用非单词字符分裂，你可能会错过一些连字符。我知道连字符的使用正在减少，仍然有大量的词属于这一规则的范围，例如中年。如果遇到这样的单词，可能必须将其视为一个单词，而不是两个单词。

删除标点符号

由于上面的决定，您需要首先删除可能附加在您的单词上的标点符号。请记住，如果您使用正则表达式来拆分单词，您可能能够在执行上述步骤的同时完成此步骤。实际上，这是首选的，这样您就不必重复两次了。一次就把这两件事都做了。在使用时，在输入字符串上调用toLowerCase()，以消除大写单词和小写单词之间的歧义。

创建地图以收集结果

这是你要去的地方收集你的计数。使用Java Map的TreeMap实现。对于这个特定的实现，需要注意的一点是，映射是根据其键的自然顺序排序的。在这种情况下，由于键是输入文本中的单词，所以键将按字母顺序排列，而不是按计数的大小排列。如果按计数对条目进行排序很重要，那么就可以使用一种技术来“反转”映射，并使值成为键和值的键。但是，由于两个或多个单词可能具有相同的计数，因此需要创建一个新的映射，以便将相同计数的单词组合在一起。

重复你的单词列表

此时，您应该有一个单词列表和一个地图结构来收集计数。使用lambda表达式，您应该能够非常容易地执行count()或您的单词。但是，如果您对Lambda表达式不熟悉或不习惯，您可以使用一个常规循环结构来迭代您的列表，执行一个containsKey()检查以查看之前是否遇到过这个单词，如果映射中已经包含了该单词，则get()值，然后将"1“添加到前面的值中。最后，put()地图中的新计数。

显示结果

同样，您可以使用Lambda表达式打印出EntrySet键值对，或者简单地在条目集上迭代以显示结果。

基于以上所有要点，一个潜在的解决方案应该如下所示(为了操作而不使用Lambda )

public static void main(String[] args) {
    String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    
    text = text.replaceAll("\\p{P}", ""); // replace all punctuations
    text = text.toLowerCase(); // turn all words into lowercase
    String[] wordArr = text.split(" "); // create list of words

    Map<String, Integer> wordCount = new TreeMap<>();
    
    // Collect the word count
    for (String word : wordArr) {
        if(!wordCount.containsKey(word)){
            wordCount.put(word, 1);
        } else {
            int count = wordCount.get(word);
            wordCount.put(word, count + 1);
        }
    }
    
    Iterator<Entry<String, Integer>> iter = wordCount.entrySet().iterator();
    
    System.out.println("Output: ");
    while(iter.hasNext()) {
        Entry<String, Integer> entry = iter.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

这将产生以下输出

Output: 
advantage: 1
algorithms: 5
and: 1
combine: 1
computer: 1
each: 1
engineers: 1
even: 1
for: 2
in: 1
invent: 1
more: 1
new: 1
of: 2
other: 2
others: 1
over: 1
producing: 1
results: 2
take: 1
the: 1
things: 1
time: 1
to: 1
turn: 1
utilize: 1
with: 1
work: 1

为什么我会为了这么平凡的任务而这样分解这个问题呢？很简单。我认为，这些离散步骤中的每一个都应该提取到函数中，以提高代码的可重用性。是的，使用Lambda表达式同时完成所有事情并使代码看起来更简单是很酷的。但是，如果你需要一次又一次的中间步骤呢？大多数情况下，代码都是重复的来完成这一任务。在现实中，更好的解决方案通常是将这些任务分解为方法。其中一些任务，比如转换输入文本，可以在一个方法中完成，因为该活动在本质上似乎是相关的。(有一种方法“做得太少”)

public String[] createWordList(String text) {
    return text.replaceAll("\\p{P}", "").toLowerCase().split(" ");
}

public Map<String, Integer> createWordCountMap(String[] wordArr) {
    Map<String, Integer> wordCountMap = new TreeMap<>();

    for (String word : wordArr) {
        if(!wordCountMap.containsKey(word)){
            wordCountMap.put(word, 1);
        } else {
            int count = wordCountMap.get(word);
            wordCountMap.put(word, count + 1);
        }
    }

return wordCountMap;
}

String void displayCount(Map<String, Integer> wordCountMap) {
    Iterator<Entry<String, Integer>> iter = wordCountMap.entrySet().iterator();
    
    while(iter.hasNext()) {
        Entry<String, Integer> entry = iter.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

现在，这样做之后，您的main方法看起来更可读性更强，代码更可重用。

public static void main(String[] args) {
    
    WordCount wc = new WordCount();
    String text = "...";
    
    String[] wordArr = wc.createWordList(text);
    Map<String, Integer> wordCountMap = wc.createWordCountMap(wordArr);
    wc.displayCount(wordCountMap);
}

更新

我忘记提到的一个小细节是，如果使用的不是TreeMap，而是HashMap，输出将按计数值降序排序。这是因为散列函数将使用条目的值作为散列。因此，您不需要为此目的“反转”地图。因此，在切换到HashMap之后，输出应该如下：

Output: 
algorithms: 5
other: 2
for: 2
turn: 1
computer: 1
producing: 1
...

票数 1

Stack Overflow用户

发布于 2022-01-13 15:33:54

我的建议是在分组示例3中使用regexp、拆分和流。

EX1此解决方案不使用集合列表/映射数组，对我来说不是最优的。

@Test
public void testApp2() {
    final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    final String lowerText = text.toLowerCase();
    final String[] split = lowerText.split("\\W+");
    System.out.println("Output: ");
    for (String s : split) {
        if (s == null) {
            continue;
        }
        int count = 0;
        for (int i = 0; i < split.length; i++) {
            final boolean sameWorld = s.equals(split[i]);
            if (sameWorld) {
                count = count + 1;
                split[i] = null;
            }
        }
        System.out.println(s + " " + count);
    }
}

EX2，我想这就是你的意思，但我不确定我是不是用得太多了

@Test
public void testApp() {
    final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    final String[] split = text.split("\\W+");
    final List<String> list = new ArrayList<>();
    System.out.println("Output: ");
    for (String s : split) {
        if(!list.contains(s)){
            list.add(s.toUpperCase());
            final long count = Arrays.stream(split).filter(s::equalsIgnoreCase).count();
            System.out.println(s+" "+count);
        }
    }

}

下面的EX3是对您的示例的一个测试，但使用映射

    @Test
public void test() {
    final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    Map<String, Long> result = Arrays.stream(text.split("\\W+")).collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
    assertEquals(result.get("algorithms"), new Long(5));
    System.out.println("Output: ");
    result.entrySet().stream().forEach(x -> System.out.println(x.getKey() + " " + x.getValue()));
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70698555

复制

相似问题

问计算文本java中出现的单词数
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算文本java中出现的单词数EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问计算文本java中出现的单词数
EN