文章/答案/技术大牛

发布

社区首页 >问答首页 >Python中的Markov文本生成器程序

问Python中的Markov文本生成器程序
EN

Code Review用户

提问于 2021-08-30 09:32:48

回答 2查看 235关注 0票数 2

这是我在Python中的第一个非平凡的程序。我来自Java背景，我可能搞砸了或者忽略了一些约定。我想听听关于我的代码的反馈。

import nltk
import random

file_name = input()
file = open(file_name, "r", encoding="utf-8")
# nltk.trigrams returns a list of 3-tuples
trigrams = list(nltk.trigrams(file.read().split()))
file.close()

model = {}
for trigram in trigrams:
    head = trigram[0] + " " + trigram[1]
    tail = trigram[2]
    model.setdefault(head, {})
    model[head].setdefault(tail, 0)
    model[head][tail] += 1

possible_starting_heads = []
sentence_ending_punctuation = (".", "!", "?")
for key in model.keys():
    if key[0].isupper() and not key.split(" ")[0].endswith(sentence_ending_punctuation):
        possible_starting_heads.append(key)

# Generate 10 pseudo-sentences based on model
for _ in range(10):
    tokens = []
    # Chooses a random starting head from list
    head = random.choice(possible_starting_heads)
    # print("Head: ", head)
    tokens.append(head)
    while True:
        possible_tails = list(model[head].keys())
        weights = list(model[head].values())
        # Randomly select elements from list taking their weights into account
        most_probable_tail = random.choices(possible_tails, weights, k=1)[0]
        # print("Most probable tail: ", most_probable_tail)

        if most_probable_tail.endswith(sentence_ending_punctuation) and len(tokens) >= 5:
            tokens.append(most_probable_tail)
            # print("Chosen tail and ending sentence: ", most_probable_tail)
            break
        elif not most_probable_tail.endswith(sentence_ending_punctuation):
            tokens.append(most_probable_tail)
            # print("Chosen tail: ", most_probable_tail)
            head = head.split(" ")[1] + " " + most_probable_tail
        elif most_probable_tail.endswith(sentence_ending_punctuation) and len(tokens) < 5:
            # print("Ignoring tail: ", most_probable_tail)
            tokens = []
            head = random.choice(possible_starting_heads)
            tokens.append(head)

    pseudo_sentence = " ".join(tokens)
    print(pseudo_sentence)

```

python

beginner

random

natural-language-processing

markov-chain

回答 2

Code Review用户

发布于 2021-08-30 20:51:49

下面是一篇有意识的评论：

读取文件

input()接受一个参数，通常是提示符或问题，这样用户就知道输入了什么。您还可以在命令行中检查sys.argv的文件名。

避免使用内置函数的名称(例如，file)作为变量名。

open()是一个上下文管理器，因此可以在with语句中使用它。在当前代码中，list(nltk.trigrams(file.read().split()))中任何位置的异常都可能导致文件处于打开状态。使用with语句确保文件被关闭：

with open(file_name, "r", encoding="utf-8") as input_file:
    # nltk.trigrams returns a list of 3-tuples
    trigrams = list(nltk.trigrams(input_file.read().split()))

`model`数据结构

了解标准库中的collections模块。在这里，defaultdict和Counter将是有用的。

元组可以是字典键，因此没有必要将三联图中的前两项连接起来。

import collections

model = collections.defaultdict(collections.Counter)

for trigram in trigrams:
    model[trigram[:2]].update(trigram[2:])

或者，说得更清楚一点：

for word1, word2, word3 in trigrams:
    model[(word1, word2)].update((word3,))

然后，可以同时收集第一个单词：

possible_starting_heads = collections.Counter()

for word1, word2, word3 in trigrams:
    model[(word1, word2)].update((word3,))

    if word1[0].isupper() and not word1.endswith(sentence_ending_punctuation):
        possible_starting_heads.update((word1, word2))

生成句

如果你想做大量的句子，那么每次重新创建候选尾巴的列表和它们的权重可能会使事情变慢。考虑重组模型以更好地适应最终使用。这可以一次做到：

new_model = {}
for key, counter in model.items():
    new_model[key] = (list(counter.keys()), list(accumulate(counter.values())))

most_probable_tail是个用词不当的词，它是被选中的尾巴。

if ... elif ... elif语句的逻辑不容易理解，.endswith(sentence_ending_punctuation)可能会被调用三次。

tokens = []
# Chooses a random starting head from list
head = random.choice(possible_starting_heads)
# print("Head: ", head)
tokens.append(head)

while True:
    possible_tails, weights = model[head]

    chosen_tail = random.choices(possible_tails, cum_weights=weights, k=1)

    if not chosen_tail.endswith(sentence_ending_punctuation):
        tokens.append(most_probable_tail)
        head = (head[1], chosen_tail)

    else:
        if len(tokens) >= 5:
            tokens.append(chosen_tail)
            break

        else:
            tokens = []
            head = random.choice(possible_starting_heads)
            tokens.append(head)

pseudo_sentence = " ".join(tokens)

其他想法

你可以用键of (''，'')在模型中存储句子的第一个单词，用键of (''，word1)存储第二个单词。然后，在生成句子时，它们不需要单独处理。

NLTK有一个可能有用的函数word_tokenize()。它将文本分解为符号，如单词、数字和标点符号。它能识别出像Dr.这样的东西

票数 2

Code Review用户

发布于 2021-08-30 19:28:04

在下面的文章中，我假设您使用的是Python3而不是Python2，尽管我不能肯定这会产生不同的效果。

首先，在解析文件时，可以使用上下文管理器：

file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
    # nltk.trigrams returns a list of 3-tuples
    trigrams = list(nltk.trigrams(file.read().split()))

这样，即使分配给trigram(在4种方法调用中的任一种)都会引发异常，文件流也会被正确关闭。

您可以使用来自defaultdict包的collections简化模型生成。你的做法实际上是错误的，我不能肯定地说，这个选项更多的是琵琶，但它可能是有趣的知道。

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
for trigram in trigrams:
    head = trigram[0] + " " + trigram[1]
    tail = trigram[2]
    model[head][tail] += 1

这并没有改变你算法的行为，我只是觉得它更简单一些。

但是你可以做一些更有记忆效率的事情：

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
    # nltk.trigrams returns a list of 3-tuples
    trigrams = nltk.trigrams(file.read().split())
    for trigram in trigrams:
        head = trigram[0] + " " + trigram[1]
        tail = trigram[2]
        model[head][tail] += 1

因为nltk.trigrams返回迭代器，而且您只使用它一次，所以在迭代它之前，不需要将它存储在列表中(这个操作需要一些时间和内存将所有内容从迭代器复制到列表中)。您甚至可以完全删除trigrams变量并直接执行for... in nltk.trigrams...。

最后，在您的while循环中，您的most_probable_tail被错误地命名:它不是最可能的，而是使用可能不一致的定律随机选择的算法。我宁愿叫它candidate_tail，或者selected_tail，或者更好的叫tail。

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/266521

复制

相似问题

问Python中的Markov文本生成器程序
EN

回答 2

Code Review用户

读取文件

`model`数据结构

生成句

其他想法

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python中的Markov文本生成器程序EN

回答 2

Code Review用户

读取文件

model数据结构

生成句

其他想法

Code Review用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python中的Markov文本生成器程序
EN

`model`数据结构