我使用SpeechKit将语音转文本返回到列表中。比如列一个杂货店的清单。但我总是得到多个值。我不知道有什么问题。我知道这个函数只调用了一次,但它返回了多个值。下面是它的外观和代码的gif。请给我一些指导。

func prepareAudioEngine() {
let node = audioEngine.inputNode
let recordingFormat = node.outputFormat(forBus: 0)
node.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, _) in
self.request.append(buffer)
}
audioEngine.prepare()
do {
try audioEngine.start()
} catch {
return print(error)
}
guard let myRecogizer = speechRecognizer else { return }
if !myRecogizer.isAvailable {
return
}
recordandRecognizeSpeech()
}
func recordandRecognizeSpeech() {
recognitionTask = speechRecognizer?.recognitionTask(with: request, resultHandler: { (result, error) in
if let result = result {
let stringArray = result.bestTranscription.formattedString
let size = stringArray.reversed().firstIndex(of: " ") ?? stringArray.count
let startWord = stringArray.index(stringArray.endIndex, offsetBy: -size)
let last = stringArray[startWord...]
self.detectedTextLabel.text = String(last).capitalized
} else if let error = error {
print("There was an error",error)
}
self.ingredients.append(Ingredient(name: self.detectedTextLabel.text ?? "Default", imageName: ""))
let indexPath = IndexPath(item: self.ingredients.count - 1, section: 0)
self.tableView.insertRows(at: [indexPath], with: .automatic)
})
}发布于 2020-06-08 08:51:23
所以我认为这里发生的事情是,当request.shouldReportPartialResults = true,多个SFSpeechRecognitionResult在speechRecognizer解析时被返回。将您的recognitionTask代码替换为以下代码:
let df = DateFormatter()
df.dateFormat = "y-MM-dd H:m:ss.SSSS"
recognitionTask = speechRecognizer?.recognitionTask(with: request, resultHandler: { (result, error) in
if let result = result {
let d = Date()
print(df.string(from: d)) // -> "2016-11-17 17:51:15.1720"
print("isFinal: \(result.isFinal)")
for (isegment, segment) in result.bestTranscription.segments.enumerated() {
print("\(isegment): \(segment.substring) (ts \(segment.timestamp), dur \(segment.duration), conf \(segment.confidence)")
}
} else if let error = error {
print("There was an error",error)
}
})在我的iPad上测试时,检查Xcode控制台输出...
这是我启动音频引擎时所看到的,说“嘿,做”,短暂停顿(~1秒),然后说“你听到我说的”
2020-06-07 17:31:14.0330
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:14.2190
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
1: do (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:14.2560
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:14.5600
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
1: do (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:14.6690
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
1: do (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:14.7930
isFinal: false
0: Hey (ts 0.55, dur -0.55, conf 0.745)
1: do (ts 0.84, dur -0.84, conf 0.816)
2020-06-07 17:31:15.6900
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
1: do (ts 0.0, dur 0.0, conf 0.0)
2: you (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:15.8630
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
1: do (ts 0.0, dur 0.0, conf 0.0)
2: you (ts 0.0, dur 0.0, conf 0.0)
3: hear (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:16.1120
isFinal: false
0: Hey (ts 0.0, dur 0.0, conf 0.0)
1: do (ts 0.0, dur 0.0, conf 0.0)
2: you (ts 0.0, dur 0.0, conf 0.0)
3: hear (ts 0.0, dur 0.0, conf 0.0)
4: me (ts 0.0, dur 0.0, conf 0.0)
2020-06-07 17:31:16.1950
isFinal: false
0: Hey (ts 0.55, dur -0.55, conf 0.93)
1: do (ts 0.84, dur -0.84, conf 0.915)
2: you (ts 1.92, dur -1.92, conf 0.927)
3: hear (ts 2.2800000000000002, dur -2.2800000000000002, conf 0.932)
4: me (ts 2.5100000000000002, dur -2.5100000000000002, conf 0.923)等待一小段时间(~2秒),然后停止audioEngine:
2020-06-07 17:31:18.4710
isFinal: false
0: Hey (ts 0.55, dur -0.55, conf 0.93)
1: do (ts 0.84, dur -0.84, conf 0.915)
2: you (ts 1.92, dur -1.92, conf 0.927)
3: hear (ts 2.2800000000000002, dur -2.2800000000000002, conf 0.932)
4: me (ts 2.5100000000000002, dur -2.5100000000000002, conf 0.923)
2020-06-07 17:31:18.5200
isFinal: true
0: Hey (ts 0.55, dur 0.2899999999999999, conf 0.93)
1: do (ts 0.84, dur 0.42000000000000004, conf 0.915)
2: you (ts 1.92, dur 0.3600000000000003, conf 0.927)
3: hear (ts 2.2800000000000002, dur 0.22999999999999998, conf 0.932)
4: me (ts 2.5100000000000002, dur 0.3099999999999996, conf 0.923)请注意以下几点:
当解析器似乎对them.
result仍在“被解析”时(或者其他任何东西),所有的时间戳都变成了合理的值。这些时间戳最终变得有效,正如您所预期的那样,我们得到了一个单调递增时间戳的标记列表。从先前解析的音频片段报告的时间戳保持不变。result上的isFinal标志为True。不幸的是,我在SFTranscriptionSegment或SFSpeechRecognitionResult类中看不到任何其他标志来帮助区分“仍在解析”和“好了,我已经做完了”。
Recommendation
为了实时处理转录,我建议根据结果的时间戳过滤结果。保留令牌和时间戳(或其他东西)的输出列表,仅当令牌到达时具有更大的时间戳时才添加输出列表。
我预计这可能会在recognitionTask在整个音频片段的末尾决定要从片段的早期更改单词的情况下中断。然后,您将在段的早期部分获得更改的令牌和时间戳。
对于您的特定情况,如果音频中有一段时间静默,则停止audioEngine,然后重新启动它可能会有所帮助。这将迫使recognitionTask完成对音频的解析,并确保段/时间戳/令牌不会更改。
https://stackoverflow.com/questions/58795543
复制相似问题