Speech recognition - kaldi - Online Audio Server (server client establishment method - old version online decoding)

Article catalogue


There are several programs in kaldi's toolset that can be used for online identification. These programs are located in the src/onlinebin folder. They are compiled from the files in the src/online folder (you can now compile with the make ext command). Most of these programs also need the support of portaudio library files in the tools folder. The portaudio library files can be downloaded and installed using the corresponding script files in the tools folder.

 

# Installing portaudio
yum -y install *alsa*
cd kaldi/tools/
./install_portaudio.sh

# Compile online identification tool
cd src/
make ext
 Or enter kaldi/src/online and kaldi/src/onlinebin,part make clean ,make It's a perfect solution
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

1, Establishment method of server client identification system

The establishment of the whole online identification system requires:

  • Prepare two machines and install kaldi;
  • As the server machine, prepare the sound model, dictionary, decoding network and feature conversion matrix (I haven't used the conversion matrix yet)
  • Start the server first, and then start the client connection after the server is running.

1. Command line to start the server:

Use the following command to start the server online audio server decode Fast:

online-audio-server-decode-faster --verbose=1 --rt-min=0.5 --rt-max=3.0 --max-active=6000 \
--beam=72.0 --acoustic-scale=0.0769 final.mdl graph/HCLG.fst graph/words.txt '1:2:3:4:5' \
graph/word_boundary.int 5010 final.mat
  • 1
  • 2
  • 3

1.1 Arguments are as follow:

  • final.mdl - the acoustic model
  • HCLG.fst - the complete FST
  • words.txt - word dictionary (mapping word ids to their textual representation)
  • '1:2:3:4:5' - list of silence phoneme ids
  • 5010 - port the server is listening on
  • word_boundary.int- a list of phoneme boundary information required for word alignemnt
  • final.mat - feature LDA matrix

Configuration options:

Options:
   - Acoustic scale: scaling factor of acoustic likelihood (floating point number, default value)= 0.1)
  --batch-size: Number of eigenvectors processed without interruption( int,default = 27)
  --beam: Decode the beam. Bigger - >Slower and more accurate. ( float,default= 16)
  --beam-delta: Increment used in decoder[Fuzzy setting](Floating point number, default= 0.5)
  --beam-update: Beam update rate (floating point number, default)= 0.01)
  --cmn-window: Number of feats. Running average CMN Vector used in calculation( int,default= 600)
  --delta-order: delta Calculation order( int,default = 2)
  --delta-window: Parameter control window for incremental calculation (the actual window size of each incremental order is 1 + 2 * delta-window-size)(int,default = 2)
  --hash-ratio: Settings in the decoder to control hash behavior( float,default = 2)
  --inter-utt-sil: Maximum number of silent frames to trigger a new utterance( i​​nt,default = 50)
  --left-context: Frame number of left context( int,default = 4)
  --max-active: Decoder maximum active state. Larger->slow;More accurate( int,default= 2147483647)
  --max-beam-update: Maximum beam update rate (floating point number, default)= 0.05)
  --max-utt-length: If the utterance becomes longer than this number of frames, a shorter silence can be accepted as the utterance separator( int,default = 1500)
  --min-active: Decoder minimum active state (if any)#If active is less than this value, do not trim). (int, default = 20)
  --min-cmn-window: Used at the beginning of decoding Minumum CMN Window (add delay only at startup)( int,default = 100)
  --num-attempts: termination stream The number of consecutive times the timeout was repeated( int,default = 5)
  --right-context: The number of frames in the correct context( int,default = 4)
  --rt-max: Approximate maximum decoding run time factor( float,default = 0.75)
  --rt-min: Approximate minimum decoding run time factor (floating point number, default)= 0.7)
  --update-interval: Beam update interval in frame( int,default= 3)

Standard options:
  --config: Configuration file to read (this option may be repeated)( string,default ="")
  --help: Print out message usage( bool,default = false)
  --print-args: Print command line parameters (to) stderr)(bool,default = true)
  --verbose: Level of detail (higher) - >(more logging)( int,default = 0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28

Note: if there is no word_boundary.int # need to run prepare again_ Lang.sh generation. Amend as follows:

#Original instruction:
utils/prepare_lang.sh --position-dependent-phones false data/local/dict "<SPOKEN_NOISE>" \
data/local/lang data/lang
#Replace with:
utils/prepare_lang.sh data/local/dict "<SPOKEN_NOISE>" data/local/lang data/lang
  • 1
  • 2
  • 3
  • 4
  • 5

The results after startup are as follows:

2. Command line to start the client:

Directly run the following command to start the client:

 online-audio-client --htk --vtt localhost 5010 scp:test.scp
  • 1

2.1 Arguments are as follow:

  • –htk - save results as an HTK label file
  • –vtt - save results as a WebVTT file
  • localhost - server to connect to
  • 5010 - port to connect to
  • scp:test.scp - list of WAV files to send
    After startup, the client continuously transmits data and the server decodes it in real time! The results are as follows:

    The result is to identify while transmitting:

*Command line to start the Java client:

Mobile client I haven't tried yet:

java -jar online-audio-client.jar
  • 1

Or simply double-click the JAR file in the graphical interface.

2, Using microphone to establish real-time decoding between client and server

kaldi provides a decoding tool to read the microphone data of the client. You can use the microphone to send audio at the client, and the server returns the decoded data in real time.

1. Use online server GMM decode fast to start the server:

  • The received features are decoded through the network. Word segmentation is done immediately. If optional (last) parameters are given, feature stitching / LDA transformation is used. Otherwise, the delta / delta delta (2nd order) feature is used by default.
Usage: online-server-gmm-decode-faster [options] model-infst-in word-symbol-table silence-phones udp-port [lda-matrix-in]
Example: online-server-gmm-decode-faster --rt-min=0.3 --rt-max=0.5 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 model HCLG.fst words.txt '1:2:3:4:5' 1234 lda-matrix 
  • 1
  • 2

Configuration options:

Options:
   - Acoustic scale: scaling factor of acoustic likelihood (floating point number, default value)= 0.1)
  --batch-size: Number of eigenvectors processed without interruption( int,default = 27)
  --beam: Decode the beam. Bigger - >Slower and more accurate. ( float,default= 16)
  --beam-delta: Increment used in decoder[Fuzzy setting](Floating point number, default= 0.5)
  --beam-update: Beam update rate (floating point number, default)= 0.01)
  --cmn-window: Number of feats. Running average CMN Vector used in calculation( int,default= 600)
  --delta-order: delta Calculation order( int,default = 2)
  --delta-window: Parameter control window for incremental calculation (the actual window size of each incremental order is 1 + 2 * delta-window-size)(int,default = 2)
  --hash-ratio: Settings in the decoder to control hash behavior( float,default = 2)
  --inter-utt-sil: Maximum number of silent frames to trigger a new utterance( i​​nt,default = 50)
  --left-context: Frame number of left context( int,default = 4)
  --max-active: Decoder maximum active state. Larger->slow;More accurate( int,default= 2147483647)
  --max-beam-update: Maximum beam update rate (floating point number, default)= 0.05)
  --max-utt-length: If the utterance becomes longer than this number of frames, a shorter silence can be accepted as the utterance separator( int,default = 1500)
  --min-active: Decoder minimum active state (if any)#If active is less than this value, do not trim). (int, default = 20)
  --min-cmn-window: Used at the beginning of decoding Minumum CMN Window (add delay only at startup)( int,default = 100)
  --num-attempts: termination stream The number of consecutive times the timeout was repeated( int,default = 5)
  --right-context: The number of frames in the correct context( int,default = 4)
  --rt-max: Approximate maximum decoding run time factor( float,default = 0.75)
  --rt-min: Approximate minimum decoding run time factor (floating point number, default)= 0.7)
  --update-interval: Beam update interval in frame( int,default= 3)

Standard options:
  --config: Configuration file to read (this option may be repeated)( string,default ="")
  --help: Print out usage messages( bool,default = false)
  --print-args: Print command line parameters (to) stderr)(bool,default = true)
  --verbose: Level of detail (higher) - >(more logging)( int,default = 0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28

2. Use online net client to start the client:

  • Through the online net client tool, using the microphone (portaudio) as the input, extract the features and send them to the server through the network connection. The specific settings are as follows:
Usage: online-net-client server-address server-port 

Options:
  --batch-size                : The number of feature vectors to be extracted and sent in one go (int, default = 27)
  
Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)
  --verbose                   : Verbose level (higher->more logging) (int, default = 0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

quote: kaldi-asr

Tags: asr kaldi

Posted by lesliesathish on Fri, 20 May 2022 05:53:25 +0300