文章/答案/技术大牛

发布

社区首页 >问答首页 >围绕特定单词的句子选择

问围绕特定单词的句子选择
EN

Stack Overflow用户

提问于 2021-01-26 19:19:12

回答 2查看 86关注 0票数 0

假设我有一段话：

Str_wrds ="Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs."

并具有以下test_wrds：

Test_wrds = ['Power curve', 'data-driven','wind turbines']

只要Test_wrds在段落中找到1句话，我就会选择它之前和之后，并将它们作为单独的字符串列出。例如，Test_wrds Power curve首先出现在第一个句子中，因此当我们选择第二个句子时，还有另一个Power curve单词，因此输出将如下所示

Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.

同样，我想为data-driven和wind turbines分割句子，并将它们保存在单独的字符串中。

我如何使用Python以一种简单的方式实现这一点？

到目前为止，我发现只要有任何Text_wrds，代码基本上就会删除整个句子。

def remove_sentence(Str_wrds , Test_wrds):
    return ".".join((sentence for sentence in input.split(".")
                    if Test_wrds not in sentence))

但是我不知道如何使用它来解决我的问题。

更新问题:基本上，只要段落中存在test_wrds，我就会将该句子以及前后的句子切片，并将其保存在单个字符串中。例如，对于三个text_wrds，我希望得到三个字符串，它们基本上分别包含了带有text_wrds的句子。我附上了pdf，例如，输出，我正在寻找

scapy

python

text

nlp

nltk

回答 2

Stack Overflow用户

发布于 2021-01-26 19:45:13

您可以定义一个如下所示的函数

def find_sentences( word, text ):
    sentences = text.split('.')
    findings = []
    for i in range(len(sentences)):
        if word.lower() in sentences[i].lower():
            if i==0:
                findings.append( sentences[i+1]+'.' )
            elif i==len(sentences)-1:
                findings.append( sentences[i-1]+'.' )
            else:
                findings.append( sentences[i-1]+'.' + sentences[i+1]+'.' )
    return findings

然后，可以将其调用为

findings = find_sentences( 'Power curve', Str_wrds )

一些漂亮的印刷品

for finding in findings:
print( finding +'\n')

我们得到了结果

However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.

Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. Data-driven model accuracy is significantly affected by uncertainty.

The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.

The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines..

我希望这就是你想要的:)

票数 0

Stack Overflow用户

发布于 2021-01-26 20:50:22

当你说，

只要Test_wrds在段落中找到一句话，我就会选择它之前和之后，并将它们作为单独的字符串列出。

我猜你的意思是，所有包含Test_wrds中的一个单词的句子，在它们之前和之后的句子，也应该被选中。

函数

def remove_sentence(Str_wrds: str, Test_wrds):
    # store all selected sentences
    all_selected_sentences = {}
    # initialize empty dictionary
    for k in Test_wrds:
        # one element for each occurrence
        all_selected_sentences[k] = [''] * Str_wrds.lower().count(k.lower())

    # list of sentences
    sentences = Str_wrds.split(".")

    word_counter = {}.fromkeys(Test_wrds,0)

    for i, sentence in enumerate(sentences):
        for j, word in enumerate(Test_wrds):
            # case insensitive
            if word.lower() in sentence.lower():
                if i == 0:  # first sentence
                    chosen_sentences = sentences[0:2]
                elif i == len(sentences) - 1:  # last sentence
                    chosen_sentences = sentences[-2:]
                else:
                    chosen_sentences = sentences[i - 1:i + 2]

                # get which occurrence of the word is it
                k = word_counter[word]

                all_selected_sentences[word][k] += '.'.join(
                    [s for s in chosen_sentences
                        if s not in all_selected_sentences[word][k]]) + "."

                word_counter[word] += 1  # increment the word counter

    return all_selected_sentences

运行以下代码

answer = remove_sentence(Str_wrds, Test_wrds)
print(answer)

使用为Str_wrds和Test_wrds提供的值，返回以下输出

{
    'Power curve': [
        'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.',
        'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty.',
        ' The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.',
        ' The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
    ],
    'data-driven': [
        ' However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.',
        ' Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression.',
        ' Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated.'
    ],
    'wind turbines': [
        ' A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
    ]
}

备注：

此函数返回一个lists
dict每个键都是Test_wrds中的一个单词，列表元素是该单词的一个匹配项。例如，
，因为单词‘幂曲线’在整个文本中出现了4次，所以输出中‘幂曲线’的值是一个包含4个元素的列表。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65900364

复制

相似问题

问围绕特定单词的句子选择
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问围绕特定单词的句子选择EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问围绕特定单词的句子选择
EN