Table of Contents
Julius accepts waveform input and extracted feature vector input. Waveform data can be given as either a audio file recording speech, or live audio stream via a capture device. You can also use a feature vector input in HTK format.
This chapter describes the specification of audio input in Julius and related tools. For more details about runtime options relating audio input, see the "Audio input" section of the reference manual.
Quantization bits of the input speech should be 16 bit. No support for 8bit or 24bit input currently.
Number of channels in the recorded data should be one. On live
recognition via microphone, the device should support 1 channel
recording. Exception is that if you are using OSS interface on
Linux (-input oss
) and only 2-channel recording
(stereo) is available, Julius tries to record with the two channel
and use only its left channel data.
The sampling rate of the input should be given explicitly.
The default sampling rate if no option was given is 16,000 Hz.
Option -smpFreq
or -smpPeriod
can be
used to specify the sampling rate in either Hz or 100ns unit
respectively. Another way is to use -htkconf
option to give Julius the HTK Config file you used at AM training,
in which case the value of SOURCERATE
in the Config
file will be set.
You should give the correct sampling rate based on the acoustic model you are going to use for recognition. The sampling rate of the input should be the same as the training condition of the acoustic model.
If you are going to use multiple acoustic models with different acoustic conditions, their sampling rate should be the same. You should give the same sampling rate parameters for all the acoustic models and (if you have) GMMs. For more details, see the next chapter about feature extraction.
The given sampling rate will work as requirement to the input. If you use a kind of live input like microphone capture, the given sampling rate will be set to the device and capturing will begin with the sampling rate. Julius will gets error when the sampling rate is not supported on the device. On the other hand, if you are recognizing an audio file, the sampling frequency of the input file is examined against the given sampling rate, and will be rejected if they do not match. [1]
Option -input rawfile
tells Julius to read an audio
input from file. You can give a file name to be processed to the
standard input of Julius. Multiple files can be processed one by one
by listing the file names to a text file and specify it by -filelist
By default, Julius will assume one file as one sentence utterance, with silence part at the beginning and end of the file. But you can apply voice activity detection, silence cutting and other functions normally used for the microphone input by specifying some options. You can also use a Julius function called "short-pause segmentation" to do successive recognition of a long audio stream. See the corresponding chapter of this book for details.
Julius can read the following audio file format by default:
Microsoft WAVE format WAV file (16bit, PCM (no compression), monaural)
RAW file: no header, signed short (16bit), Big Endian, monoral
If you use libsndfile
with Julius, you can use
additional formats like AU,
NIST, ADPCM and so on.
The libsndfile
will be use in Julius if you have
libsndfile development files (headers and libraries) in your system
when you compile a Julius from source.
You may pay some attentions to the RAW file format. Julius accepts only Big Endian format. If you give Little Endian format RAW file, Julius cannot detect it and outputs wrong result with no warning. You can convert the endianness using sox like this:
%
sox -t .raw -s -w -c 1infile
-t .raw -s -w -c 1 -xoutfile
Also you should be careful whether the RAW file has correct data (sampling rate etc.) for the acoustic model you use, since RAW file does not have any header information in itself and Julius can not check them automatically.
Option -input mic
will tell Julius to get the audio
input from a raw audio device like microphone or line input. This
feature is OS dependent, and supported in Linux, Windows, Mac OS X,
FreeBSD and Solaris. [2]
Detection of spoken region from continous input will be performed prior to the main recognition task. By default, a sound input will be detected by a simple level-based detection (level and zero-cross threshold), and then real-time recognition will be performed for each detected region. See the chapter of voice activity detection to tune the detection, or using advanced feature like GMM-based detection. Also see the notes for the real-time recognition.
Julius does not handle any mixer setting of the machine. You should properly set its mixer setting such as recording volumes or capture device (microphone / line) etc.
The recording quality GREATLY affects the recognition performance. Less distortions and less noises will improve the accuracy. Also you should set a proper volume to avoid clipping at a loud voice.
You can check how Julius listens the input audio. If you have a
runnning Julius, the best way is to specify an option -record
dir
to save the processed audio data per sentence into files.
Another way is to use the tools in Julius distribution,
adinrec and
adintool, to record audio. They use the
same function with Julius, so what they recorded is what Julius will
hear.
Julius has two sound API interface for Linux:
ALSA
OSS
When specifying -input mic
, Julius uses ALSA
interface to capture audio. You can still explicitly specify
which API to use by using option -input ALSA
or -input oss
.
The sound card should support 16-bit recording. Julius uses monaural (1-channel) recording by default, but if you are using OSS interface and only have stereo recording, Julius will recognize its left channel. You can also use USB audio devices.
Another devices can be selected by defining environmental variables.
When using ALSA interface (this is default), the default device name
string is "default
". The device name can be
altered by environment variable ALSADEV
, for example,
if you have multiple audio device and set
ALSADEV="plughw:1,0"
, Julius will listen to the
second sound card. When using OSS interface the default device name
is /dev/dsp
, and it can be changed by the
environmental variable AUDIODEV
.
On Windows, Julius uses DirectSound API via PortAudio library.
When using Portaudio V19, device will be searched in order of ASIO¡¤
DirectSound and MME. The record device can be specified by the environmental
variable PORTAUDIO_DEV
. When using portaudio v19, the
instruction will be output into the log at audio initialization.
On Mac OS X, Julius uses CoreAudio API. It is confirmed to run on Mac OS X v10.3.9 and v10.4.1.
On FreeBSD, Julius used the standard snd
driver.
If compilation fails, try --with-mictype=oss
.
You may encounter a time delay on audio input and may want to minimize it. This section describes the reason and show some method to improve it.
Since Most OS are not real-time system, Audio input is oftern buffered
per a small chunk (or fragment) at kernel side. When a chunk is
filled by the capture device, it will be transmitted to the user
process. So the input will delay for the length of the chunk. You
can set the size of a chunk by the environment variable
LATENCY_MSEC
(the value should be milliseconds, not the
byte size!). The default value is dependent on OS, and will be output
to the tty at startup time. Setting smaller value will decrease the
delay, but CPU load will gets higher and may slow down the whole
system.
-input adinnet
makes Julius to receive audio stream
via network socket. The protocol is a specific one sending just a
sequence of audio sample streams per a small packet. There are no
detailed document for the procotol, but it's a basic and very simple
one, since it has no encription or encode/decode features.
adintool implements the protocol. You can test the adintool like this:
Run Julius with network input (it will stop for waiting connection)%
julius .... -input adinnet -freqsrate
Run adintool to send audio to the Julius (server_hostname
should be the host where the abobe Julius is running)%
adintool -in mic -out adinnet -serverserver_hostname
-freqsrate
-input esd
tells Julius to get audio input via EsounD
daemon (esd) is supported on Linux.
esd is an audio daemon used to share audio
I/O among multiple applications. For more details, see the
esd manual.
Option -input stdin
makes Julius to read input
from standard input. Only RAW file format is supported with this option.
Julius can read a feature vector file already extracted from a speech data by other applications such as HTK. You can use this feature to recognize with acoustic features unsupported by Julius. Supported file format is HTK feature file format.
-input htkparam
or -input mfcfile
tells Julius to read the file as feature vector file.
Like audio file input, multiple file names can be given
by listing file names one at a line into a text and specify the file
by -filelist
.
Given an input as feature vector file, its feature type is examined against the acoustic model you are going to use. When they does not match, Julius first checks the difference. If their base forms (basic types) are the same and only the qualifier below is different, Julius modifies the input vector to match the acoustic model and use it, else Julius outputs error and ignore the input.
addition / removal of delta coef. (_D)
addition / removal of accelleration coef. (_A)
supression of energy (_N)
Please note that this checking can be disabled by the option
-notypecheck
Julius ver.4.1 and later is capable of extending its audio interface by external plugin. When your target OS is not supported by Julius, or you want add some network-based input into Julius, you can develop a plugin to enable it. See the chapter describing plugin development for more details.
[1] Please note that this sampling rate check does not work at RAW file input, since RAW file has no header information.