文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在iOS上用袖珍狮身人面像运行唤醒词检测？

问如何在iOS上用袖珍狮身人面像运行唤醒词检测？
EN

Stack Overflow用户

提问于 2021-07-05 13:48:02

回答 1查看 158关注 0票数 0

我试着在iOS上运行口袋狮身人面像的唤醒词检测。作为基础，我使用TLSphinx和演讲文本工作(不是很好的STT，但它识别单词)。

我将decoder.swift扩展为一个新函数：

public func detectWakeWord (complete: @escaping (Bool?) -> ()) throws {

    ps_set_keyphrase(psDecoder, "keyphrase_search", "ZWEI")
    ps_set_search(psDecoder, "keyphrase_search")
            
    do {
      if #available(iOS 10.0, *) {
          try AVAudioSession.sharedInstance().setCategory(.playAndRecord, mode: .voiceChat, options: [])
      } else {
          try AVAudioSession.sharedInstance().setCategory(.playAndRecord)
      }
    } catch let error as NSError {
        print("Error setting the shared AVAudioSession: \(error)")
        throw DecodeErrors.CantSetAudioSession(error)
    }

    engine = AVAudioEngine()

    let input = engine.inputNode
    let mixer = AVAudioMixerNode()
    let output = engine.outputNode
    engine.attach(mixer)
    engine.connect(input, to: mixer, format: input.outputFormat(forBus: 0))
    engine.connect(mixer, to: output, format: input.outputFormat(forBus: 0))

    // We forceunwrap this because the docs for AVAudioFormat specify that this constructor return nil when the channels
    // are greater than 2.
    let formatIn = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 44100, channels: 1, interleaved: false)!
    let formatOut = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 16000, channels: 1, interleaved: false)!
    guard let bufferMapper = AVAudioConverter(from: formatIn, to: formatOut) else {
        // Returns nil if the format conversion is not possible.
        throw DecodeErrors.CantConvertAudioFormat
    }

    mixer.installTap(onBus: 0, bufferSize: 2048, format: formatIn, block: {
        [unowned self] (buffer: AVAudioPCMBuffer!, time: AVAudioTime!) in

        guard let sphinxBuffer = AVAudioPCMBuffer(pcmFormat: formatOut, frameCapacity: buffer.frameCapacity) else {
            // Returns nil in the following cases:
            //    - if the format has zero bytes per frame (format.streamDescription->mBytesPerFrame == 0)
            //    - if the buffer byte capacity (frameCapacity * format.streamDescription->mBytesPerFrame)
            //    cannot be represented by an uint32_t
            print("Can't create PCM buffer")
            return
        }

        // This is needed because the 'frameLenght' default value is 0 (since iOS 10) and cause the 'convert' call
        // to faile with an error (Error Domain=NSOSStatusErrorDomain Code=-50 "(null)")
        // More here: http://stackoverflow.com/questions/39714244/avaudioconverter-is-broken-in-ios-10
        sphinxBuffer.frameLength = sphinxBuffer.frameCapacity

        var error : NSError?
        let inputBlock : AVAudioConverterInputBlock = {
              inNumPackets, outStatus in
              outStatus.pointee = AVAudioConverterInputStatus.haveData
              return buffer
          }
        bufferMapper.convert(to: sphinxBuffer, error: &error, withInputFrom: inputBlock)
        print("Error? ", error as Any);
      
        let audioData = sphinxBuffer.toData()
        self.process_raw(audioData)

        print("Process: \(buffer.frameLength) frames - \(audioData.count) bytes - sample time: \(time.sampleTime)")

        self.end_utt()
        
        let hypothesis = self.get_hyp()
          
        print("HYPOTHESIS: ", hypothesis)

        DispatchQueue.main.async {
          complete(hypothesis != nil)
        }
      
        self.start_utt()
    })

    start_utt()

    do {
        try engine.start()
    } catch let error as NSError {
        end_utt()
        print("Can't start AVAudioEngine: \(error)")
        throw DecodeErrors.CantStartAudioEngine(error)
    }
  }

没有错误，但hypothesis总是为零。我的字典把所有的东西都映射到"ZWEI"，所以如果发现了什么，就应该检测到唤醒词。

ZWEI AH P Z EH TS B AAH EX
ZWEI(2) HH IH T
ZWEI(3) F EH EX Q OE F EH N T L IH CC T
ZWEI(4) G AX V AH EX T AX T
...
ZWEI(12113) N AY NZWO B IIH T AX N

有人知道为什么假设总是零吗？

ios

swift

cmusphinx

pocketsphinx

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-07-06 15:21:13

我不得不在self.get_hyp() 之前运行 self.end_utt()。

我不知道为什么，但这是不同的讲话和文字调用顺序。

编辑

另一个提示:为了更好的唤醒单词检测质量，增加麦克风输入的缓冲区大小。例如：

mixer.installTap(onBus: 0, bufferSize: 8192, format: formatIn, block: [...]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68257260

复制

相似问题

问如何在iOS上用袖珍狮身人面像运行唤醒词检测？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在iOS上用袖珍狮身人面像运行唤醒词检测？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在iOS上用袖珍狮身人面像运行唤醒词检测？
EN