我有一个带tweet的数据帧。每行对应一条推文。我可以使用AWS Comprehend batch_detect_key_phrases()获取关键短语。batch_detect_key_phrases()在有效负载中返回一个ResultList和ErrorList。为了将关键短语结果重新合并到数据帧中,它们需要与原始tweet对齐,因此我需要保持ResultList和ErrorList对齐。
线267上的code here分别处理ErrorList和ResultList。
根据Python,"ErrorList ( Boto docs ) --对于包含错误的每个文档,包含一个对象的列表。结果按照索引字段的升序进行排序,并与输入列表中文档的顺序相匹配……“
我在下面写的代码使用ResultList和ErrorList索引号,以确保它们被正确地合并到keyPhrases列表中,然后该列表将被合并回原始数据帧。从本质上讲,keyPhrases是与数据帧行0相关联的关键短语。如果在处理tweet时出现错误,则会向数据帧中的该行添加占位符错误消息。
我认为可能保持ResultList和ErrorList对齐的唯一另一种方法是将这两个列表合并成一个更大的列表,按它们各自的索引升序排列。接下来,我将处理这1个更大的列表。
有没有一种更简单的方法来处理ResultList和ErrorList,使它们保持一致?
keyphraseResults = {'ResultList': [
{'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]},
{'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},
{'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]},
{'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}],
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
{"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}],
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}
# Holds the ordered list of key phrases that correspond to the data frame.
keyPhrases = []
# Set it to an arbitrarily large number in case ErrorList below is empty we'll still
# need a number for comparison.
errIndexlist = [9999]
# This will be inserted for the rows corresponding to the ErrorList.
ErrorMessage = "* Error processing keyphrases"
# Since the rows of the response need to be kept in alignment with the rows of the dataframe,
# get the error indicies first, if any. These will be compared to the ResultList below.
if 'ErrorList' in keyphraseResults and len(keyphraseResults['ErrorList']) > 0:
batchErroresults = keyphraseResults["ErrorList"]
errIndexlist = []
for entry in batchErroresults:
errIndexlist.append(entry["Index"])
print(entry)
# Sort the indicies to ensure they are in ascending order since that order is
# important for the logic below.
errIndexlist.sort(reverse = False)
if 'ResultList' in keyphraseResults:
batchResults = keyphraseResults["ResultList"]
for entry in batchResults:
resultDict = entry["KeyPhrases"]
if len(errIndexlist) > 0:
if entry['Index'] < errIndexlist[0]:
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 1:
results = results[2:]
keyPhrases.append(results)
else:
# Else we have an error to merge from the PRIOR result.
keyPhrases.append(ErrorMessage)
errIndexlist.remove(errIndexlist[0])
# THEN add the key phrase for the current result.
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 1:
results = results[2:]
keyPhrases.append(results)
print("\nFinal results are:")
for text in keyPhrases:
print(text)发布于 2020-06-08 04:17:29
我是基于这个SO post弄明白的。
总而言之,合并索引和ErrorList,对合并后的列表进行索引排序,然后按顺序处理合并后的列表。
from operator import itemgetter
keyphraseResults = {'ResultList': [
{'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]},
{'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},
{'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]},
{'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}],
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
{"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}],
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}
keyPhrases = []
# This will be inserted for the rows in ErrorList or just make it empty.
ErrorMessage = "* Error processing keyphrases"
if len(keyphraseResults["ResultList"]) > 0 and len(keyphraseResults["ErrorList"]) > 0:
processResults = keyphraseResults["ResultList"].copy() + keyphraseResults["ErrorList"].copy()
elif len(keyphraseResults["ResultList"]) > 0:
processResults = keyphraseResults["ResultList"].copy()
else:
processResults = keyphraseResults["ErrorList"].copy()
processResults = sorted(processResults, key=itemgetter('Index'), reverse = False)
for entry in processResults:
if 'ErrorCode' in entry:
keyPhrases.append(ErrorMessage)
elif 'KeyPhrases' in entry:
resultDict = entry["KeyPhrases"]
results = ""
for textDict in resultDict:
results = results + ", " + textDict['Text']
# Remove the leading comma.
if len(results) > 2:
results = results[2:]
keyPhrases.append(results)
print("\nFinal results are:")
for text in keyPhrases:
print(text)https://stackoverflow.com/questions/62239071
复制相似问题