首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何合并亚马逊网络服务的理解batch_detect_key_phrases() ResultList和ErrorList

如何合并亚马逊网络服务的理解batch_detect_key_phrases() ResultList和ErrorList
EN

Stack Overflow用户
提问于 2020-06-07 07:10:36
回答 1查看 163关注 0票数 0

我有一个带tweet的数据帧。每行对应一条推文。我可以使用AWS Comprehend batch_detect_key_phrases()获取关键短语。batch_detect_key_phrases()在有效负载中返回一个ResultList和ErrorList。为了将关键短语结果重新合并到数据帧中,它们需要与原始tweet对齐,因此我需要保持ResultList和ErrorList对齐。

线267上的code here分别处理ErrorList和ResultList。

根据Python,"ErrorList ( Boto docs ) --对于包含错误的每个文档,包含一个对象的列表。结果按照索引字段的升序进行排序,并与输入列表中文档的顺序相匹配……“

我在下面写的代码使用ResultList和ErrorList索引号,以确保它们被正确地合并到keyPhrases列表中,然后该列表将被合并回原始数据帧。从本质上讲,keyPhrases是与数据帧行0相关联的关键短语。如果在处理tweet时出现错误,则会向数据帧中的该行添加占位符错误消息。

我认为可能保持ResultList和ErrorList对齐的唯一另一种方法是将这两个列表合并成一个更大的列表,按它们各自的索引升序排列。接下来,我将处理这1个更大的列表。

有没有一种更简单的方法来处理ResultList和ErrorList,使它们保持一致?

代码语言:javascript
复制
keyphraseResults = {'ResultList': [
            {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
            {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
            {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
            {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
              {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}

# Holds the ordered list of key phrases that correspond to the data frame. 
keyPhrases = []

# Set it to an arbitrarily large number in case ErrorList below is empty we'll still 
# need a number for comparison. 
errIndexlist = [9999]

# This will be inserted for the rows corresponding to the ErrorList. 
ErrorMessage = "* Error processing keyphrases"

# Since the rows of the response need to be kept in alignment with the rows of the dataframe, 
# get the error indicies first, if any. These will be compared to the ResultList below.
if 'ErrorList' in keyphraseResults and len(keyphraseResults['ErrorList']) > 0:
    batchErroresults = keyphraseResults["ErrorList"]
    errIndexlist = []

    for entry in batchErroresults:
        errIndexlist.append(entry["Index"])
        print(entry)

# Sort the indicies to ensure they are in ascending order since that order is 
# important for the logic below. 
errIndexlist.sort(reverse = False)

if 'ResultList' in keyphraseResults:

    batchResults = keyphraseResults["ResultList"]

    for entry in batchResults:

        resultDict = entry["KeyPhrases"]

        if len(errIndexlist) > 0:

            if entry['Index'] < errIndexlist[0]:

                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

            else:
                # Else we have an error to merge from the PRIOR result.
                keyPhrases.append(ErrorMessage)
                errIndexlist.remove(errIndexlist[0])

                # THEN add the key phrase for the current result.
                results = ""
                for textDict in resultDict: 
                    results = results + ", " + textDict['Text']

                # Remove the leading comma.
                if len(results) > 1:
                    results = results[2:]

                keyPhrases.append(results)

print("\nFinal results are:")
for text in keyPhrases:
    print(text)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-06-08 04:17:29

我是基于这个SO post弄明白的。

总而言之,合并索引和ErrorList,对合并后的列表进行索引排序,然后按顺序处理合并后的列表。

代码语言:javascript
复制
from operator import itemgetter

keyphraseResults = {'ResultList': [
        {'Index': 0, 'KeyPhrases': [{'Score': 0.9999997615814209, 'Text': 'financial status', 'BeginOffset': 26, 'EndOffset': 42}, {'Score': 1.0, 'Text': 'my job', 'BeginOffset': 58, 'EndOffset': 64}, {'Score': 1.0, 'Text': 'title', 'BeginOffset': 69, 'EndOffset': 71}, {'Score': 1.0, 'Text': 'a new job', 'BeginOffset': 77, 'EndOffset': 86}]}, 
        {'Index': 1, 'KeyPhrases': [{'Score': 0.9999849796295166, 'Text': 'Holy moley', 'BeginOffset': 0, 'EndOffset': 4}, {'Score': 1.0, 'Text': 'Batman', 'BeginOffset': 27, 'EndOffset': 29}, {'Score': 1.0, 'Text': 'has a jacket', 'BeginOffset': 47, 'EndOffset': 55}]},                 
        {'Index': 3, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'USA', 'BeginOffset': 4, 'EndOffset': 7}]}, 
        {'Index': 5, 'KeyPhrases': [{'Score': 0.9999970197677612, 'Text': 'home town', 'BeginOffset': 6, 'EndOffset': 15}]}], 
        'ErrorList': [{"ErrorCode": "123", "ErrorMessage": "First error goes here", "Index": 2},
          {"ErrorCode": "456", "ErrorMessage": "Second error goes here", "Index": 4}], 
        'ResponseMetadata': {'RequestId': '123b6c73-45e0-4595-b943-612accdef41b',   'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '123b6c73-e5f7-4b95-b52s-612acc71341d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '1125', 'date': 'Sat, 06 Jun 2020 20:38:04 GMT'}, 'RetryAttempts': 0}}

keyPhrases = []

# This will be inserted for the rows in ErrorList or just make it empty. 
ErrorMessage = "* Error processing keyphrases"

if len(keyphraseResults["ResultList"]) > 0 and len(keyphraseResults["ErrorList"]) > 0:
    processResults = keyphraseResults["ResultList"].copy() + keyphraseResults["ErrorList"].copy()
elif len(keyphraseResults["ResultList"]) > 0:
    processResults = keyphraseResults["ResultList"].copy()
else:
    processResults = keyphraseResults["ErrorList"].copy()

processResults = sorted(processResults, key=itemgetter('Index'), reverse = False)

for entry in processResults:

    if 'ErrorCode' in entry:
        keyPhrases.append(ErrorMessage)

    elif 'KeyPhrases' in entry:
        resultDict = entry["KeyPhrases"]

        results = ""
        for textDict in resultDict: 
            results = results + ", " + textDict['Text']

        # Remove the leading comma.
        if len(results) > 2:
            results = results[2:]

        keyPhrases.append(results)

print("\nFinal results are:")
for text in keyPhrases:
    print(text)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62239071

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档