Speech recognition test using Kaldi CVTE v2 model

initial file

First install kaldi, refer to the official documentation

Then, download http://kaldi-asr.org/models/m2 and extract it to egs/cvte, make sure the file kaldi/egs/cvte/s5 exists.

The following describes how to add a new voice file and perform a recognition test.

data/wav/chat001

Store voice files

data/wav/chat001
├── 001.wav
└── 002.wav

Recording of voice files, refer to Common Toolset for Voice Processing, Commands

voice file format

$ sox --info data/wav/chat001/001.wav

Input File     : 'data/wav/chat001/001.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:06.25 = 100000 samples ~ 468.75 CDDA sectors
File Size      : 200k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

data/chat001/test

cd egs/cvte/s5
data/chat001/test
├── conf
│   └── fbank.conf
├── frame_shift
├── spk2utt
├── text
├── utt2spk
└── wav.scp

Among them, the files of conf and frame_shift are copied from data/fbank/test

wav.scp, list of voice files

CHAT001_20200801_001 data/wav/chat001/001.wav
CHAT001_20200801_002 data/wav/chat001/002.wav

The space between the first column and the second column is a tab, which cannot be replaced with 4 spaces, the same below

text, the corresponding text of the voice file

CHAT001_20200801_001	Shanghai Pudong Airport Entry Prevention Entry Full Closed Loop Management 
CHAT001_20200801_002	Beijing Subway Xuanwumen Station Comprehensive Renovation New Transfer Channel

In the text, the second column is the words separated by spaces, the vocabulary is in exp/chain/tdnn/graph/words.txt

The corresponding morpheme file exp/chain/tdnn/graph/phones.txt

Relationship between vocabulary and morpheme file exp/chain/tdnn/graph/phones/align_lexicon.int

Above, for example, 149, 133 are morphemes, which are defined in exp/chain/tdnn/graph/phones.txt.

spk2utt, utt2spk The mapping relationship between speakers and speech files.

$ cat data/chat001/test/utt2spk
CHAT001_20200801_001 CHAT001_20200801_001
CHAT001_20200801_002 CHAT001_20200801_002

$ cat data/chat001/test/spk2utt
CHAT001_20200801_001 CHAT001_20200801_001
CHAT001_20200801_002 CHAT001_20200801_002

In the above, the index ID of the file is used as the speaker. In kaldi, the speaker is a broad concept. Ideally, an ID is set for each independent "speaker".

Check the initial file

utils/validate_data_dir.sh data/chat001/test

Automatically resolve errors

utils/fix_data_dir.sh data/chat001/test
Auto-resolving errors takes into account completion of sort, etc.

Perform decode and view WER

run script

kaldi/egs/cvte/s5/run_chat001.sh

#!/bin/bash


. ./cmd.sh
. ./path.sh

# step 1: generate fbank features
obj_dir=data/chat001

for x in test; do
  rm -rf fbank/$x
  mkdir -p fbank/$x

  # compute fbank without pitch
  steps/make_fbank.sh --nj 1 --cmd "run.pl" $obj_dir/$x exp/make_fbank/$x fbank/$x || exit 1;
  # compute cmvn
  steps/compute_cmvn_stats.sh $obj_dir/$x exp/fbank_cmvn/$x fbank/$x || exit 1;
done

# #step 2: offline-decoding
test_data=data/chat001/test
dir=exp/chain/tdnn

steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
  --nj 1 --num-threads 1 \
  --cmd "$decode_cmd" --iter final \
  --frames-per-chunk 50 \
  $dir/graph $test_data $dir/decode_chat001_test

# # note: the model is trained using "apply-cmvn-online",
# # so you can modify the corresponding code in steps/nnet3/decode.sh to obtain the best performance,
# # but if you directly steps/nnet3/decode.sh,
# # the performance is also good, but a little poor than the "apply-cmvn-online" method.

The script execution is divided into the following steps:

Step1 - Generate Features for Test Data

$ tree data/chat001/test
data/chat001/test
├── cmvn.scp
├── conf
│   └── fbank.conf
├── feats.scp
├── frame_shift
├── spk2utt
├── split1
│   └── 1
│       ├── cmvn.scp
│       ├── feats.scp
│       ├── spk2utt
│       ├── text
│       ├── utt2dur
│       ├── utt2num_frames
│       ├── utt2spk
│       └── wav.scp
├── text
├── utt2dur
├── utt2num_frames
├── utt2spk
└── wav.scp

feats.scp, utt2dur, utt2num_frames are all generated by make_fbank.sh, and other files are also generated under fbank/test.
cmvn.scp, is the normalization file, generated by steps/compute_cmvn_stats.sh.
The splitN folder is a self-folder formed one by one when the program is executed concurrently and then merged when there is a large amount of data.

fbank/test directory

fbank/test
├── cmvn_test.ark
├── cmvn_test.scp
├── raw_fbank_test.1.ark
└── raw_fbank_test.1.scp

exp/make_fbank directory

exp/make_fbank
└── test
    ├── make_fbank_test.1.log
    └── wav.1.scp

Step2 - Decoding

steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
  --nj 1 --num-threads 1 \
  --cmd "$decode_cmd" --iter final \
  --frames-per-chunk 50 \
  $dir/graph $test_data $dir/decode_chat001_test

Decoding also calculates WER, which can be set to output nBest.

View decoding information

cat exp/chain/tdnn/decode_chat001_test/log/decode.1.log

Optimal WER result

$ cat exp/chain/tdnn/decode_chat001_test/scoring_kaldi/best_cer
%WER 2.94 [ 1 / 34, 0 ins, 0 del, 1 sub ] exp/chain/tdnn/decode_chat001_test/cer_7_0.0

This test has a total of 34 words, and the edit distance of the recognition result is 0 insertion, 0 deletion, and 1 replacement.
However, the replacement word "closed loop" does not exist in the pronunciation dictionary, and the recognition result is "closed" and "loop". In fact, the two words can also be regarded as accurate recognition.

other logs

WER nBest output exp/chain/tdnn/decode_chat001_test/scoring_kaldi

Decoding command interpretation

During the decoding phase, the script executed is as follows:

# nnet3-latgen-faster --frame-subsampling-factor=3 --frames-per-chunk=50 --extra-left-context=0 --extra-right-context=0 --extra-left-context-initial=-1 --extra-right-context-final=-1 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=1.0 --allow-partial=true --word-symbol-table=exp/chain/tdnn/graph/words.txt exp/chain/tdnn/final.mdl exp/chain/tdnn/graph/HCLG.fst "ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |" "ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz"
# Started at Sat Aug  1 16:21:14 CST 2020
#
nnet3-latgen-faster --frame-subsampling-factor=3 --frames-per-chunk=50 --extra-left-context=0 --extra-right-context=0 --extra-left-context-initial=-1 --extra-right-context-final=-1 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=1.0 --allow-partial=true --word-symbol-table=exp/chain/tdnn/graph/words.txt exp/chain/tdnn/final.mdl exp/chain/tdnn/graph/HCLG.fst 'ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |' 'ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz'
LOG (nnet3-latgen-faster[5.5.765-f88d5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
lattice-scale --acoustic-scale=10.0 ark:- ark:-
apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:-
LOG (nnet3-latgen-faster[5.5.765-f88d5]:CheckAndFixConfigs():nnet-am-decodable-simple.cc:294) Increasing --frames-per-chunk from 50 to 51 to make it a multiple of --frame-subsampling-factor=3
CHAT001_20200801_001 Shanghai Pudong Airport Entry Room Input Full Closed Loop Management
LOG (nnet3-latgen-faster[5.5.765-f88d5]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:375) Log-like per frame for utterance CHAT001_20200801_001 is 2.19918 over 208 frames.
LOG (apply-cmvn[5.5.765-f88d5]:main():apply-cmvn.cc:162) Applied cepstral mean normalization to 2 utterances, errors on 0
CHAT001_20200801_002 Beijing Subway Xuanwumen Station Comprehensive Renovation New Transfer Channel
LOG (nnet3-latgen-faster[5.5.765-f88d5]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:375) Log-like per frame for utterance CHAT001_20200801_002 is 2.19511 over 333 frames.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:256) Time taken 10.9386s: real-time factor assuming 100 frames/sec is 0.673972
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:259) Done 2 utterances, failed for 0
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:261) Overall log-likelihood per frame is 2.19668 over 541 frames.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.00447 seconds taken in nnet3 compilation total (breakdown: 0.00219 compilation, 0.00168 optimization, 0 shortcut expansion, 0.000385 checking, 1.1e-05 computing indexes, 0.000209 misc.) + 0 I/O.
LOG (lattice-scale[5.5.765-f88d5]:main():lattice-scale.cc:107) Done 2 lattices.
# Accounting: time=53 threads=1
# Ended (code 0) at Sat Aug  1 16:22:07 CST 2020, elapsed time 53 seconds

Let's take a closer look at the parameter list

nnet3-latgen-faster \
    --frame-subsampling-factor=3 \
    --frames-per-chunk=50 \
    --extra-left-context=0 \
    --extra-right-context=0 \
    --extra-left-context-initial=-1 \
    --extra-right-context-final=-1 \
    --minimize=false \
    --max-active=7000 \
    --min-active=200 \
    --beam=15.0 \
    --lattice-beam=8.0 \
    --acoustic-scale=1.0 \
    --allow-partial=true \
    --word-symbol-table=exp/chain/tdnn/graph/words.txt \
    exp/chain/tdnn/final.mdl \
    exp/chain/tdnn/graph/HCLG.fst \
    "ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |" \
    "ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz"

nnet3-latgen-faster command:
Based on the decoder LatticeFasterDecoder, acoustic sub-source, nnet3 model

There are also similar nnet3-latgen-faster-parallel, nnet3-latgen-faster-batch commands.

Print the following help for nnet3-latgen-faster:

Generate lattices using nnet3 neural net model.
Usage: nnet3-latgen-faster [options] <nnet-in> <fst-in|fsts-rspecifier> <features-rspecifier> <lattice-wspecifier> [ <words-wspecifier> [<alignments-wspecifier>] ]
See also: nnet3-latgen-faster-parallel, nnet3-latgen-faster-batch
 
Options:
  --acoustic-scale            : Scaling factor for acoustic log-likelihoods (caution: is a no-op if set in the program nnet3-compute (float, default = 0.1)
  --allow-partial             : If true, produce output even if end state was not reached. (bool, default = false)
  --beam                      : Decoding beam.  Larger->slower, more accurate. (float, default = 16)
  --beam-delta                : Increment used in decoding-- this parameter is obscure and relates to a speedup in the way the max-active constraint is applied.  Larger is more accurate. (float, default = 0.5)
  --computation.debug         : If true, turn on debug for the neural net computation (very verbose!) Will be turned on regardless if --verbose >= 5 (bool, default = false)
  --debug-computation         : If true, turn on debug for the actual computation (very verbose!) (bool, default = false)
  --delta                     : Tolerance used in determinization (float, default = 0.000976562)
  --determinize-lattice       : If true, determinize the lattice (lattice-determinization, keeping only best pdf-sequence for each word-sequence). (bool, default = true)
  --extra-left-context        : Number of frames of additional left-context to add on top of the neural net's inherent left context (may be useful in recurrent setups (int, default = 0)
  --extra-left-context-initial : If >= 0, overrides the --extra-left-context value at the start of an utterance. (int, default = -1)
  --extra-right-context       : Number of frames of additional right-context to add on top of the neural net's inherent right context (may be useful in recurrent setups (int, default = 0)
  --extra-right-context-final : If >= 0, overrides the --extra-right-context value at the end of an utterance. (int, default = -1)
  --frame-subsampling-factor  : Required if the frame-rate of the output (e.g. in 'chain' models) is less than the frame-rate of the original alignment. (int, default = 1)
  --frames-per-chunk          : Number of frames in each chunk that is separately evaluated by the neural net.  Measured before any subsampling, if the --frame-subsampling-factor options is used (i.e. counts input frames (int, default = 50)
  --hash-ratio                : Setting used in decoder to control hash behavior (float, default = 2)
  --ivectors                  : Rspecifier for iVectors as vectors (i.e. not estimated online); per utterance by default, or per speaker if you provide the --utt2spk option. (string, default = "")
  --lattice-beam              : Lattice generation beam.  Larger->slower, and deeper lattices (float, default = 10)
  --max-active                : Decoder max active states.  Larger->slower; more accurate (int, default = 2147483647)
  --max-mem                   : Maximum approximate memory usage in determinization (real usage might be many times this). (int, default = 50000000)
  --min-active                : Decoder minimum #active states. (int, default = 200)
  --minimize                  : If true, push and minimize after determinization. (bool, default = false)
  --online-ivector-period     : Number of frames between iVectors in matrices supplied to the --online-ivectors option (int, default = 0)
  --online-ivectors           : Rspecifier for iVectors estimated online, as matrices.  If you supply this, you must set the --online-ivector-period option. (string, default = "")
  --optimization.allocate-from-other : Instead of deleting a matrix of a given size and then allocating a matrix of the same size, allow re-use of that memory (bool, default = true)
  --optimization.allow-left-merge : Set to false to disable left-merging of variables in remove-assignments (obscure option) (bool, default = true)
  --optimization.allow-right-merge : Set to false to disable right-merging of variables in remove-assignments (obscure option) (bool, default = true)
  --optimization.backprop-in-place : Set to false to disable optimization that allows in-place backprop (bool, default = true)
  --optimization.consolidate-model-update : Set to false to disable optimization that consolidates the model-update phase of backprop (e.g. for recurrent architectures (bool, default = true)
  --optimization.convert-addition : Set to false to disable the optimization that converts Add commands into Copy commands wherever possible. (bool, default = true)
  --optimization.extend-matrices : This optimization can reduce memory requirements for TDNNs when applied together with --convert-addition=true (bool, default = true)
  --optimization.initialize-undefined : Set to false to disable optimization that avoids redundant zeroing (bool, default = true)
  --optimization.max-deriv-time : You can set this to the maximum t value that you want derivatives to be computed at when updating the model.  This is an optimization that saves time in the backprop phase for recurrent frameworks (int, default = 2147483647)
  --optimization.max-deriv-time-relative : An alternative mechanism for setting the --max-deriv-time, suitable for situations where the length of the egs is variable.  If set, it is equivalent to setting the --max-deriv-time to this value plus the largest 't' value in any 'output' node of the computation request. (int, default = 2147483647)
  --optimization.memory-compression-level : This is only relevant to training, not decoding.  Set this to 0,1,2; higher levels are more aggressive at reducing memory by compressing quantities needed for backprop, potentially at the expense of speed and the accuracy of derivatives.  0 means no compression at all; 1 means compression that shouldn't affect results at all. (int, default = 1)
  --optimization.min-deriv-time : You can set this to the minimum t value that you want derivatives to be computed at when updating the model.  This is an optimization that saves time in the backprop phase for recurrent frameworks (int, default = -2147483648)
  --optimization.move-sizing-commands : Set to false to disable optimization that moves matrix allocation and deallocation commands to conserve memory. (bool, default = true)
  --optimization.optimize     : Set this to false to turn off all optimizations (bool, default = true)
  --optimization.optimize-row-ops : Set to false to disable certain optimizations that act on operations of type *Row*. (bool, default = true)
  --optimization.propagate-in-place : Set to false to disable optimization that allows in-place propagation (bool, default = true)
  --optimization.remove-assignments : Set to false to disable optimization that removes redundant assignments (bool, default = true)
  --optimization.snip-row-ops : Set this to false to disable an optimization that reduces the size of certain per-row operations (bool, default = true)
  --optimization.split-row-ops : Set to false to disable an optimization that may replace some operations of type kCopyRowsMulti or kAddRowsMulti with up to two simpler operations. (bool, default = true)
  --phone-determinize         : If true, do an initial pass of determinization on both phones and words (see also --word-determinize) (bool, default = true)
  --prune-interval            : Interval (in frames) at which to prune tokens (int, default = 25)
  --utt2spk                   : Rspecifier for utt2spk option used to get ivectors per speaker (string, default = "")
  --word-determinize          : If true, do a second pass of determinization on words only (see also --phone-determinize) (bool, default = true)
  --word-symbol-table         : Symbol table for words [for debug output] (string, default = "")
 
Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)
  --verbose                   : Verbose level (higher->more logging) (int, default = 0)

reference reading

https://blog.csdn.net/qq_25750561/article/details/81070092

https://www.cnblogs.com/yszd/p/12192769.html

https://github.com/naxingyu/kaldi_cvte_model_test

Tags: asr kaldi Voice recognition

Posted by gothica on Wed, 25 May 2022 20:30:35 +0300