几天前,我完成了一个潜在工作的编码挑战。我对我的代码非常满意,直到我得到的反应是我的代码不够好。:(所以,显然我还在犯错误,我已经要求了反馈,但没有回应。)我真的很想知道我的缺点是什么,这样我才能进步。谁能快速看一下,告诉我还有什么更好吗?
对挑战的描述:
编写一个Python 3包,它基于语料库从文档中生成最重要的关键字短语(或关键字)。附件中,您将找到一个zip归档文件,其中包括:
使用说明:
交付品
我的全部意见书:https://github.com/GMathyssen/NLP-challenge
keywords.py
# -*- coding: utf-8 -*-
__author__ = 'Gert'
import string
import pandas as pd
import nltk
import sys
from nltk.corpus import stopwords
nltk.download('stopwords')
def main():
# Amount of max words in key-word
number_grams = 3
number_top_keywords = 20
save_file = open(sys.argv[1], 'a')
# Reading in the minimum data
script = open(sys.argv[2], "r").read()
total_trans = open(sys.argv[3], "r").read()
names_trans = [str(sys.argv[3]) + "\n"]
# Reading in optional extra transcripts
for tran in sys.argv[4:]:
total_trans += open(tran, "r").read()
names_trans.append(str(tran) + "\n")
# Processing text from the script and group key-words in script dataframe
script_data = ngrams_to_strings(get_n_grams(text_process(script), number_grams))
script_df = group_in_dataframe(script_data, "Main script")
# Taking the top n words from the script dataframe
script_df_top = script_df.head(number_top_keywords)
# Processing text from all the transcripts and group key-words in a dataframe
total_trans_data = ngrams_to_strings(get_n_grams(text_process(total_trans), number_grams))
total_trans_df = group_in_dataframe(total_trans_data, "Transcripts")
# Merge script dataframe and transcripts dataframe into one
script_trans_df = pd.concat([script_df_top, total_trans_df], axis=1, join="inner")
# Sort merged dataframe to appearance in transcipts
script_trans_df = script_trans_df.sort_values("Transcripts", ascending=False)
string1 = "\nMain script:\n%s" % sys.argv[2]
string2 = "\nTranscripts:\n"
string3 = "\nThe top %s key-words in the main script:\n" % number_top_keywords
string4 = "\nThe top %s key-words in the main script, ranked by appearance in the transcripts:\n" % number_top_keywords
# Print and write to .txt file
printlist = [string1, string2] + names_trans + [string3, str(script_df_top), string4, str(script_trans_df)]
for string in printlist:
print(string)
save_file.write(string)
def text_process(text):
# Check characters to see if they are in punctuation
no_punc = [char for char in text if char not in string.punctuation]
# Join the characters again to form the string
no_punc = ''.join(no_punc)
# Remove any stopwords
no_stopw = [word for word in no_punc.split() if word.lower() not in stopwords.words('english')]
# Stemming the words
stemmer = nltk.stem.snowball.EnglishStemmer(no_stopw)
return [stemmer.stem(i) for i in no_stopw]
def get_n_grams(word_list, n):
ngrams = []
count = 1
while count <= n:
for i in range(len(word_list)-(count-1)):
ngrams.append(word_list[i:i+count])
count += 1
return ngrams
def ngrams_to_strings(ngrams):
# First doing a sort, so that the grams with an other word order are the same
ngrams_sorted = ([sorted(i) for i in ngrams])
return [' '.join(i) for i in ngrams_sorted]
def group_in_dataframe(data, column_name):
df = pd.DataFrame(data=data, columns=["key-word"])
df = pd.DataFrame(df.groupby("key-word").size().rename(column_name))
return df.sort_values(column_name, ascending=False)
if __name__ == "__main__":
main()test_keywords.py
# -*- coding: utf-8 -*-
import unittest
from keywords import text_process, get_n_grams, ngrams_to_strings
class TestKW(unittest.TestCase):
def test_text_process(self):
self.assertEqual(text_process("This is a special test, monkeys like tests!"),
['special', 'test', 'monkey', 'like', 'test'])
def test_get_n_grams(self):
self.assertEqual(get_n_grams(['special', 'monkey', 'like'], 2),
[['special'], ['monkey'], ['like'], ['special', 'monkey'], ['monkey', 'like']])
def test_ngrams_to_strings(self):
self.assertEqual(ngrams_to_strings([["apple"], ["the", "king"]]),
['apple', 'king the'])
if __name__ == '__main__':
unittest.main()发布于 2017-06-09 19:45:50
我没有详细梳理你的代码,部分原因是我怀疑给你这个任务的人也会这么做。我猜想它实际上完成了它应该做的事情,并且正确地完成了所请求的任务而没有bug。在这种情况下,它们的问题可能与整个包结构/设计/实现有关。但总的来说,答案在一定程度上取决于你申请的工作。对于中级工程师职位的期望显然与高级工程师的期望有很大的不同,所以在某种程度上,答案取决于你申请的是哪种职位。这方面的一些暗示会有所帮助。关于代码的结构/设计/实现,在不涉及细节的情况下,我只想发表几点意见,这可能是它们的一部分问题:
consider reusability when implementing your package. it should be generic enough that given a certain input, it will provide the required output,尽管很难说,因为我可能不正确地猜测它们的意图。您允许用户通过命令行指定不同的输入变量,但这也是一个python包。python包的一个重要部分是,它们可以由其他包/模块导入并根据需要使用。按照构建包的方式,它几乎只能从命令行中使用。在我看来,让它变得更通用和更可还原意味着您可以从其他python代码中导入它,并使用它来完成这些计算,而不费太大的力气。目前,您的三种方法从其他方法中是不可缺少的,但它们只提供了整个系统功能的一小部分。最需要重用的代码是您的主函数,它被锁定在main()函数后面,这个函数根本不可以重用,因为它从命令行获取它的输入。正如我说的,我不知道他们看到了什么,但是当我查看您的代码和它们的需求时,我想到了这些事情(无论是否声明)。
https://codereview.stackexchange.com/questions/164507
复制相似问题