语音识别中的术语

LM:语言模型

MFCC:Mel频谱特征

PLP: Perceptual Linear Prediction, PLP特征

fBank: fBank特征

CMVN
Cepstral Mean and Variance Normalization
倒谱均值方差归一化

Mono:Mono phone,单音素模型训练

Triphone:三音素模型训练,一般 tri1: deltas; tri2: delta+delta-delta; tri3a: lda+mllt

GMM:高斯混合模型

HMM:隐马尔可夫

sGMM:子空间高斯混合模型(subspace GMM),可有效减少GMM参数

GMM-HMM:MFCC+Mono+Triphone

MLLT:Maximum Likely Linear Transform, 最大似然线性变换,用在training阶段

CMLLR/fMLLR:Contraint/feature Maximum Likelyhood Linear Regression, 约束最大似然线性回归/特征空间最大似然线性回归(feature-space maximum likelihood linear regression),针对说话人特征的鲁棒性,用在alignment阶段

SAT:Speaker Adaptive Training, 说话人自适应

VTLN:Vocal Tract Length Normalisation,声道长度归一化。主要用于语音识别,消除男,女的声道长度的差异。在HTK中有源码,HTK book中有介绍。修改了MEL频率中的中心频率。

LDA:Linear Discriminated Analysis, 线性判别分析

PLDA:Probality Linear Discriminated Analysis概率线性判别分析

MMI/BMMI:Maximum Mutual Information / Boosted MMI 最大互信息(最小化句子错误率?),steps/train_mmi.sh

LF-MMI: Lattice Free – Maximum Mutual Information

MPE:Minimum Phone Error, 最小化各种粒度指标的错误率,steps/train_mpe.sh

sMBR:state-level Minimum Bayes Risk, 最小化状态错误率

lattice:词格,lmrescore会用到

EM: Expection Maximumization

LMWT: language model weights, 语言模型权重

acwt: Acoustic weight(acoustic scale), 声学模型权重


下面是看kaldi脚本的时候遇到的一些术语和缩写

hires: hi-res , high resolution, to depict mfcc

scp: script file, content is of format: each line is pair of [utterence id] and [wav file or zipped wav file]

ark: archive file, token1 [something]token2 [something]token3 [something] ….

dur: duration, for example, utt2dur file is to specify pair of [utterance id] and [duration]

feats: features, like feats.scp which includes pair of [utterance id] and [mfcc feature ark file]

phones: phonemes, like phones.txt

int and txt: file extension, txt is like #1, #2, #3, while int include integer inside, for example, disambig.int and disambig.txt

disambig: it is short for disambiguation which is used for minimization and determinization of fst

lat: lattic, e.g. lat.1.gz

CTM: stands for time-marked conversation file and contains a time-aligned phoneme transcription of the utterances. Its format is:
utt_id channel_num start_time phone_dur phone_id

egs: Examples

rm: Resource Management

wsj: Wall Street Journal

s5: Script version 5

exp: Experiments

acc: Accumulate

accs: Accumulate states

ali: Alignment

mdl: Model

occs: Occurrence counts/occupancy

am: Acoustic model

csl: colon seperated list files

    分享到:

留言

你的邮箱是保密的 必填的信息用*表示