SignTutor: An Interactive System for Sign Language Tutoring

Transkript

SignTutor: An Interactive System for Sign Language Tutoring
Feature Article
SignTutor:
An Interactive
System for Sign
Language
Tutoring
Oya Aran, Ismail Ari, Lale Akarun, and Bülent Sankur
Boğaziçi University
Alexandre Benoit and Alice Caplier
GIPSA-Lab
Pavel Campr
University of West Bohemia in Pilsen
Ana Huerta Carrillo
Technical University of Madrid
François-Xavier Fanard
Université Catholique de Louvain
Language learning
can only advance
with practice and
corrective feedback.
The interactive
system, SignTutor,
evaluates users’
signing and gives
multimodal feedback
to help improve
signing.
S
ign language recognition (SLR) is a
multidisciplinary research area involving pattern recognition, computer
vision, natural language processing,
and linguistics. The concept presents a multifaceted problem not only because of the complexity
of the visual analysis of hand gestures, but also
due to the highly multimodal nature of sign
languages. Although sign languages are wellstructured languages with a phonology, morphology, syntax, and grammar, they are different
from spoken languages. The structure of a spoken language makes use of words sequentially,
whereas a sign language makes use of several
body movements in parallel. The linguistic characteristics of sign language are different from
those of spoken languages due to the existence
1070-986X/09/$25.00 c 2009 IEEE
of several context-affecting components, such
as facial expressions and head movements in
addition to hand movements.1,2
Practice can significantly enhance the learning
of a language when there is validation and feedback. This is true for both spoken and sign languages. For spoken languages, students can
evaluate their own pronunciation and improve
to some extent by listening to themselves. Similarly, sign language teachers suggest that their students practice in front of a mirror. An SLR system
lets students practice, validate, and evaluate their
signing. Such systems, including our SignTutor,
can be instrumental in assisting sign language
education, especially for non-native signers.
With our SignTutor system, we aim to teach
the basics of sign language interactively. SignTutor automatically evaluates the student’s
signing through visual feedback and information about the success of the performed sign.
The SignTutor interactive platform enables
users to watch and learn new signs, practice
them, and validate their performance. SignTutor communicates to students with various
feedback modalities: text message, recorded
video of the user, video of the segmented
hands, or avatar animation. A demonstration
video of SignTutor can be downloaded at
http://www.cmpe.boun.edu.tr/pilab/pilabfiles/
demos/signtutor_demo_DIVX.avi.
Automatic sign language recognition
A brief survey of sign language grammars
illustrates the challenges faced in developing
automated learning and tutoring systems. Sign
language phonology makes use of hand shape,
place of articulation, and movement. The morphology uses directionality, aspect, and numeral
incorporation, while the syntax uses spatial
localization and agreement as well as facial
expressions. The whole message is contained
not only in hand gestures and shapes (manual
signs) but also in facial expressions and head/
shoulder motion (nonmanual signs). As a consequence, the language is intrinsically multimodal. Hence, SLR is a complex task that uses
hand-shape recognition, gesture recognition,
face-and-body-part detection, and facialexpression recognition as basic building blocks.3
Pioneering research on hand-gesture recognition and on SLR has mainly used instrumented
gloves, which provide accurate data for hand
position and finger configuration. These systems
require users to wear cumbersome devices on
Published by the IEEE Computer Society
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
81
IEEE MultiMedia
their hands. However, humans would prefer systems that operate in their natural environment.
Since the mid 1990s, improvements in camera
hardware have enabled real-time, vision-based,
hand-gesture recognition,4 which does not use
instrumented gloves and instead requires one
or more cameras connected to the computer.
While vision-based SLR systems provide a more
user-friendly environment, they also introduce
several new challenges, such as the detection
and segmentation of the hand and finger configuration, or handling occlusions.
Because signs consist of dynamic elements, a
recognition system that focuses only on the
static aspects of the signs has a limited vocabulary. Hence, for recognizing hand gestures and
signs, we must use methods capable of modeling the gestures’ inherent temporal characteristics. Researchers have used several methods
such as neural networks, hidden Markov models (HMM), dynamic Bayesian networks, finitestate machines, or template matching.5 Among
these methods, HMMs have been used the
most extensively and proven successful in several kinds of SLR systems.
Initial studies on vision-based SLR focused on
limited-vocabulary systems, which could recognize 40 to 50 signs. These systems were capable
of recognizing isolated signs and also continuous sentences, but with constrained sentence
structure.6 One study addressed the scalability
problem, which adopted an approach based on
identifying sign phonemes and components
rather than the whole sign.7 The advantage of
identifying components is the decrease in the
number of subunits that the system needs to
model; subunits are then used to represent all
the signs in the vocabulary. With componentbased systems, large-vocabulary SLR can be
achieved to a certain degree. For example, in
an instrumented glove-based system, research
has recognized a database of 5,119 signs.8
Another important aspect of sign language
is the use of nonmanual components concomitant with hand gestures. Most of the SLR systems in the literature concentrate on handgesture analysis only. However, without integrating nonmanual signs, it’s not possible to
interpret the meaning of all these signs. There
are only a limited number of studies that integrate nonmanual and manual cues for SLR.3
Current multimodal SLR systems either integrate lip motion and hand gestures, or only
classify and incorporate either the facial expression9 or the head motion.10
Recognizing unconstrained, continuous sign
sentences is another challenging problem in
SLR. The significance of coarticulation effects
necessitates the use of language models similar
to those used in speech recognition. These
complex problems require adequate expertise
and technology, such as high-speed cameras
that have higher resolution than those commercially available. An ideal automatic SLR system should accurately recognize a large
vocabulary set enacted in continuous sentences. Such a system should operate in real time
and be able to handle different lighting and
environmental conditions. It should exploit
both the manual and nonmanual signs and
the grammar and syntax of the sign language.
There are many potential application areas
of SLR systems that require sign-to-text translation. These applications include human—
computer interaction; public information
access, such as kiosks; or translation and dialog
systems for human-to-human communication.
An interesting application area is communicating sign data over a channel. The sign is captured on the sender side and the SLR output is
sent to the receiving side where it’s synthesized
and displayed on an avatar. This scheme is
more flexible and more bandwidth efficient
than simply sending the captured sign videos.
SignTutor overview
One of the key factors of SignTutor is that it
integrates hand-motion and hand-shape analysis with head-motion analysis to recognize
signs that include both hand gestures and
head movements. This factor is important
because head movements are one aspect of
sign language that most students find hard to
perform simultaneously with hand gestures.
To offer students a learning advantage, in SignTutor we focus mostly on the signs that have
similar hand gestures and that are mainly differentiated by head movements. In view of
the prospective users and use environments of
SignTutor, we have opted for a vision-based,
user-friendly system compatible with easy-toobtain equipment, such as webcams. The system operates with a single camera focused on
the user’s upper body.
Figure 1a (next page) shows the graphical user
interface of Sign Tutor. The system follows three
steps for teaching a new sign: training, practice,
82
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Training
and feedback. For the training phase, SignTutor
Figure 1. SignTutor
GUI. (a) Training,
provides a prerecorded reference video for each
Information
sign. The users select a sign from the list of possi-
practice, information,
and synthesis panels.
ble signs and watch the prerecorded instruction
(b) Feedback examples.
video of that sign until they are ready to practice.
In the practice phase, users are asked to perform
the selected sign; their performance is recorded
by the webcam. For an accurate analysis, users
Practice
are expected to face the camera, with full upper
body in the camera field of view. The recording
Synthesis
(a)
continues until both hands move beyond the
camera’s field of view. SignTutor analyzes the
hand motion, hand shape, and head motion in
the recorded video and compares it with the
selected sign.
We have integrated several ways for the system to give users feedback regarding the quality
of the enacted sign. The system offers feedback
on student success with two components—
manual (hand motion, shape, and position) and
nonmanual (head and facial feature motion)—
along with the sign name (see Figure 1b).
Users can watch the video of their original performance. If they performed the sign properly,
they can watch an avatar perform an animated
version of their performance.
SignTutor modules
SignTutor stages consists of a face and hand
detector stage, followed by the analysis stage,
and the final sign classification, as Figure 2
shows. The critical part of SignTutor is the analysis and recognition subsystem, which receives
the camera input, detects and tracks the hand,
extracts features, and classifies the sign.
Detection
(b)
Hand detection and segmentation
Although skin color features can be applied
for hand segmentation in controlled illumination, segmentation becomes problematic
when skin regions overlap and occlude each
other. In sign language, hand positions are
often near the face and sometimes have to be
in front of the face. Hand detection, segmentation, and occlusion problems are simplified
when users wear colored gloves. The use of a
simple marker as a colored glove makes the system adaptable to changing background and
illumination conditions.
For each glove color, we train its respective
histogram of color components using several
images. We chose the hue, saturation, and
value (HSV) color space due to its ability to be
useful despite changing illumination conditions.11
Analysis
Figure 2. SignTutor
Classification
system flow: detection,
analysis, and
Hand motion
(center of mass, velocity)
Hand detection
and segmentation
(area, orientation, ...)
(with respect to face and other hand)
Hidden Markov
model classifier
January—March 2009
Hand position
Sign language video
classification steps.
Hand shape
Final decision
Face detection
Head motion
(neutral, no, yes, turn, ...)
Eye, eyebrow, mouth motion
(open, close, up, down, ...)
83
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
The HSV components are divided into bins.
At each bin of the histogram, we calculate the
number of occurrences of pixels that correspond to that bin, and normalize the histogram. To detect hand pixels in a scene, we
find the histogram bin it belongs to and apply
thresholding. To ensure connectivity, we
apply a double thresholding set at low and
high values and consider a pixel a hand pixel
if its histogram value is higher than the high
threshold. We still label the pixel as a glove
(hand) if it’s between the two thresholds, provided one or more of the neighbor pixels is
labeled a glove pixel. We assume the final
hand region as the largest connected component of the detected pixels.
IEEE MultiMedia
Analysis of hand motion
The system processes the motion of each
hand by tracking its center of mass and estimating in every frame the position and velocity of
each segmented hand. However, the hand trajectories can be corrupted by segmentation
noise. Moreover, hands might occlude each
other or there might be sudden changes in
lighting (for example, a room light turned on
or off), which could result in detection and segmentation failures. Thus, we use two independent Kalman filters, one for each hand, to
smooth the estimated trajectories. We use a
constant velocity motion model to approximate
each hand’s motion, neglecting acceleration.
When the system detects a hand in the video,
it initializes the corresponding Kalman filter.
Before each sequential frame, the Kalman filter
predicts the new hand position and updates
the filter parameters with the hand position
measurements found by the hand-segmentation
step. We calculate the hand-motion features
from the posterior states of the corresponding
Kalman filter, using x- and y-coordinates of center of mass and velocity.12 When there is a
detection failure due to occlusion or bad lighting, we only use the Kalman filter prediction
without updating the filter parameters. The system assumes that the hand is out of the camera’s
view if it detects no hand segment for a number
of consecutive frames.
We further normalize the trajectories to obtain
translation and scale invariance. To do so, we use
a normalization strategy similar to that used in
other work.12 We calculate the normalized trajectory coordinates via minimum—maximum
normalization. We handle this translation
normalization by calculating the midpoints
of the range of x- and y-coordinates, denoted
as xm, y m. We select the scaling factor, d, as
the spread maximum in x- and y-coordinates,
because scaling horizontal and vertical displacements with different factors disturbs
the shape. The normalized trajectory coordinates, (<x 0 1 ;y 0 1 >;. . .;<x 0 t ;y 0 t>;. . .;<x 0 N ;y 0 N >)
such that 0 x0 t, y0 t 1, are then calculated
as follows:
x0t ¼ 0:5 þ 0:5ðxt xm Þ=d
y 0t ¼ 0:5 þ 0:5ðy t y m Þ=d
Because signs can be two-handed, we must normalize both hand trajectories. However, normalizing the trajectory of the two hands
independently might result in data loss. To
solve this problem, we jointly calculate the midpoint and the scaling factor of the left and right
hand trajectories. Following the trajectory normalization, we translate the left and right hand
trajectories with a starting position at (0,0).
Extracting features from a 2D hand shape
Hand shape and finger configuration contribute significantly to sign identification
because each sign has a specific related movement of the head, hands, and hand postures.
Moreover, there are signs that solely depend
on hand shape. We intend our system to
work with a single, low-resolution camera
whose field of view covers the user’s upper
body and is not directly focused on the
hands. In this setup, we face several difficulties:
low resolution (less than 80 80 pixels) of
hand images,
segmentation errors due to blurring caused
by fast movement of the hands,
the same binary image, that is, a silhouette,
from more than one hand posture.
These problems constrain us to using only
low-level features that are resistant to segmentation errors and work well with low-resolution
images. To accommodate these requirements,
we use simple appearance-based shape features
calculated from the hand silhouettes and select
the features to reflect differences in hand shape
and finger postures. The shape features need to
be scale-invariant so that hands with similar
shape but a different size result in the same
84
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Table 1. Hand shape features.
Feature
number
Invariant
Hand shape
2
3
4
Best fitting ellipse.
8
9
10
11
12
13
15
16
17
18
19
Area filters: white
indicate areas
without hand and
green areas with
hand.
•
•
•
•
•
•
•
sin (2∗α) α = angle of ellipse major axis
•
•
cos (2∗α) α = angle of ellipse major axis
•
Elongation (ratio of ellipse major/minor axis length)
Percentage of NW (north-west) area filled by hand
•
•
Percentage of N area filled by hand
•
Percentage of NE area filled by hand
Percentage of E area filled by hand
•
•
Percentage of SE area filled by hand
•
Percentage of S area filled by hand
Percentage of SW area filled by hand
•
•
Percentage of W area filled by hand
•
Ratio of hand pixels outside/inside ellipse
Ratio of hand/background pixels inside ellipse
5
14
Scale Rotation
Best fitting ellipse width
Best fitting ellipse height
Compactness (perimeter2/area)
1
6
7
Feature
Total area (pixels)
Bounding box width
Bounding box height
•
features by their range, that is, by using Fn ¼
(Fmin)/(maxmin) where min is the minimum value of the feature in the training dataset and max is the maximum value. Any value
exceeding the [0,1] interval is truncated.
Head-movement analysis
Once we detect the face, we determine rigid
head motions—such as head rotations and
head nods—by using an algorithm inspired by
the human visual system.13 First, we apply a filter following the model of the human retina.14
This filter enhances moving contours with an
outer plexiform layer and cancels static ones
with an inner plexiform layer. This prefiltering
mitigates any illumination changes and noise.
Second, we compute the fast Fourier transform
of the filtered image in the log polar domain as
a model of the primary visual cortex.15 This
step extracts two types of features: the quantity
of motion and motion event alerts. In parallel,
an optic flow algorithm extracts both orientation and velocity information only on the
motion event alerts issued by the visual cortex
stage.16 Thus, after each motion alert, we estimate the head velocity at each frame. Figure 3
shows a block diagram of the algorithm. After
these three steps, the head analyzer outputs
three features per frame: the quantity of
January—March 2009
feature values. However, because our system
uses a single camera, we don’t have depth information, except for the foreshortening due to
perspective. To maintain some information
about the z-coordinate (depth), we don’t
scale-normalize five of the 19 features. Prior to
the calculation of hand-shape features, we
take a mirror reflection of the right hand and
analyze both hands in the same geometry,
with thumb to the right. Table 1 lists all 19 features. These features are not invariant to viewpoint; users are required to sign facing the
camera for an accurate analysis. The classifier
is tolerant of small rotations that can naturally
occur while signing.
Seven of the features (1, 2, 4, 5, 6, 7, and 8)
are based on using the best fitting ellipse (in
the least-squares sense) to the hand silhouette.
The inclination angle assumes values in the
range [0, 360]. To represent the four-fold symmetry of the ellipse, we use sin(2*a) and cos(2*a) as
features, where a is in the range [0, 180]. Features
9 through 16 are based on using area filters. We
divide the hand’s bounding box into eight areas
in which we calculate percentage of hand pixels.
We normalize all 19 hand-shape features into
values between 0 and 1. We obtain this normalization by dividing the percentage of features
(9 through 16) by 100, and the cardinal number
•
85
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
This smoothing has the effect of mitigating
the noise between different performances of a
sign and creating a slightly smoother pattern.
Video stream input
Outer plexiform layer
Retina filtering
computed on
all the frames
• Contours enhancer
Segmentation step
Inner plexiform layer
Face detection
Preprocessing of sign sequences
The video sequences obtained contain
frames where the signer is not performing any
sign (beginning and terminating parts) as well
as transition frames. We eliminate these frames
by considering the hand-segmentation step’s
result as follows:
• Moving contours extraction
Virtual cortex
processing on
the face area
Motion
events
Fast Fourier transform
total energy computation
Outputs
Temporal motion
evaluation
Optic flow
computation
Temporal vertical
velocity evaluation
Temporal horizontal
velocity evaluation
Eliminate all frames at the beginning of the
sequence until a hand is detected.
Figure 3. Algorithm for
rigid head motion data
extraction.
motion and the vertical and horizontal velocities, as Figure 4 shows.
These three features provided by the headmotion analyzer can vary with different performances of the same sign. Moreover, head
motion is not directly synchronized with hand
motion. To handle the inter- and intrasubject
differences, we apply weighted average smoothing to head motion features, with a ¼ 0.5. The
smoothed head motion feature vector at time
i, Fi, is calculated as Fi ¼ aFi þ (1 a) Fi1.
Quantity of motion
Quantity of motion
9
7
5
3
1
0
20
40
60
80
Frame 68:
low energy
(low/no motion)
7
5
3
1
0
20
Frame
60
100
120
Frame
0.4
Frame 68:
false alarm
0.4
Velocity
Velocity
0.2
0.2
0
−0.2
0
0
−0.2
20
40
60
80
0
20
Frame
(a)
60
100
120
Frame
(b)
Figure 4. Head analyzer output for two example videos: (a) with horizontal
head motion and (b) with vertical head motion: First row shows the IPL energy,
that is, the quantity of head motion at each frame. The second row shows the
amount of velocity at each frame, where the red line indicates the horizontal
velocity and the blue line indicates the vertical velocity.
If no hand is detected during the sequence
for less than N frames, the shape information is copied from the last frame where
there was still detection to the current
frame.
If no hand is detected for more than N
frames, we assume the sign is finished. We
eliminate the remaining frames including
the last N frames.
After these prunings, we delete transition
frames from the start and end of the
sequence.
Cluster-based classification
At the classification phase, we use HMMs to
model each sign and classify with maximum
likelihood criterion. HMMs are used in many
areas, such as speech recognition and bioinformatics, for modeling variable-length sequences
and dealing with temporal variability within
similar sequences.17 HMMs are therefore preferred in gesture and sign recognition to successfully handle changes in the speed of the
performed sign or slight changes in the spatial
domain.
We use a sequential-fusion method for combining the sign’s manual and nonmanual parts.
The strategy relies on the fact that there might
be similar signs that differ slightly and cannot
be classified accurately. These signs form a cluster; their intracluster classification must be
handled specially. Thus, we based our sequential fusion method on two successive classification steps: intercluster and intracluster
classifications. Because we want our system to
be as general as possible and open to new
signs, we don’t use any prior knowledge about
86
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
the sign clusters. We let the system discover
potential sign clusters that are similar in manual gestures but that differ in nonmanual
motions. Instead of using rule-based programming, we extract the cluster information as a
part of the recognition system.
We make the base decision using a general
model, HMMManual&Nonmanual (M&N), which uses
both hand and head features in the same feature
vector. However, this approach suffers from
high dimensionality and does not effectively
use the head information. We use the head
information in a dedicated model, HMMN,
which determines sign likelihoods and gives
the final decision.
The system uses two-fold, trained HMM
models. First, the system trains the HMM
models—for both HMMM&N and HMMN.
Next, it extracts the cluster information via
the joint confusion matrix of HMMM&N. Summing the validation sets’ confusion matrices
in a cross-validation stage forms this joint confusion matrix. The system then investigates the
misclassifications using the joint confusion
matrix. If the system correctly classifies all samples of a sign class, the sign class cluster only
contains itself. Otherwise, for each misclassification, the system marks that sign as potentially belonging to the cluster. The marked
signs become cluster members when the number of misclassifications exceeds a given threshold. We use 10 percent of the total number of
validation examples as the threshold value.
The fusion strategy for a test sample is as
follows:
Calculate the likelihoods of HMMM&N for
each sign class and select the sign class
with the maximum likelihood as the base
decision.
coordinates), 19 hand-shape features, and
three head-motion features per frame. In
addition to relying on these features, we use
the hands’ relative position with respect to the
face’s center of mass, yielding two additional
features (distance in vertical and horizontal
coordinates) per hand, per frame. We normalize
these distances by the face height and width,
and use the normalized and preprocessed
sequences to train the HMM models. Each
HMM model is a continuous, four-state, leftto-right model and is trained for each sign,
using the Baum-Welch algorithm.
Visual feedback
As one of the feedback modalities, SignTutor
provides a simple animated avatar that mimics
the users’ performance for the selected sign.
The synthesis is based on the features extracted
in the analysis part. Currently the avatar only
mimics the users’ head motion and animates
the hand motions from a library.
Our head-synthesis system is based on the
MPEG-4 Facial Animation Standard.18 As
input, it receives the motion energy, vertical
and horizontal head-motion velocities, and target sign. It then filters and normalizes this data
to compute the head motion during the considered sequence. The processing result is
expressed in terms of facial action points,
which the system feeds into the animation
player. For head animation, we use XFace, a
3D-talking-head-animation player compliant
with MPEG-4.19 We based the hands- and
arms-synthesis system on OpenGL and prepared an animation explicitly for each sign.
We merge the head animation with the hands
and arms animation to form an avatar with a
full upper body.
System accuracy evaluation
Send the selected sign and its cluster information to HMMN.
Combine the likelihoods of HMMM&N and
The classification module receives all the
features calculated in the hand- and headanalysis modules as input. For each hand,
there are four hand-motion features (position
and velocity in vertical and horizontal
January—March 2009
HMMN with the sum rule and select the sign
with the maximum likelihood as the final
decision.
We evaluate the recognition performance
of our system on a dataset of 19 signs from
ASL. We select these signs to emphasize nonmanual signs, the effect of small modifications of manual signs, and hand—head
coordination. There are eight base signs that
represent words and their systematic variations in the form of nonmanual signs, or
inflections, in the signing of the same manual sign. 3 Table 2 lists the signs we used for
evaluation. For each sign, we have recorded
five repetitions from eight subjects. In Table 2,
we show base signs and their variants with
87
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Table 2. ASL signs used in SignTutor.
Hand motion
Base sign
Clean
Variant
Clean
variation
•
Very clean
Afraid
Fast
Drink
Open (door)
Here
Study
Look at
Head
motion
Afraid
Very afraid
Fast
Very fast
To drink
Drink (noun)
To open door (noun)
[Somebody] is here
Is [somebody] here?
[Somebody] is not here
Study
Study continuously
Study regularly
Look at
Look at continuously
Look at regularly
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
the type of variation, either a hand-motion
variation or a nonmanual sign, or both. For
the rest of this article, we assume that the
base sign and its variants form a semantic
cluster; we call these clusters base sign
clusters.
IEEE MultiMedia
Classifiers
We compare three different classifiers to
show the effect of using the nonmanual information with different fusion techniques on
the classification accuracy: classification using
only manual information, feature fusion of
manual and nonmanual information, and
sequential fusion of manual and nonmanual
information. For classification using only manual information, we perform this classification
via HMMs that are trained only with the
hand gesture information related to the signs.
Because hands form the basis of the signs,
these models need to be very accurate in terms
of classification. However, absence of the
head-motion information precludes correct
classification when signs differ only in head
motion. We denote these models as HMMM.
We can combine the manual and nonmanual information in a single feature vector to
jointly model the head and hand information.
Because there is no direct synchronization
between hand and head motions, these models
need to provide better performance than
HMMM. However, using head information
results in a slight performance increase. We
indicate these models as HMMM&N.
With sequential fusion of manual and nonmanual information, our aim, as explained
in the previous section, is to apply a two-tier,
cluster-based, sequential-fusion strategy. The
first step identifies the performed sign’s cluster
within a general model, HMMM&N, and
resolves the confusion inside the cluster at the
second step with a dedicated model, HMMN,
that uses only head information. The head
motion is complementary to the sign, and
thus can’t be used alone to classify the signs.
However, in the sequential fusion methodology, indicated as HMMM&N!HMMN, head
information is used to perform intracluster classification. We report the results of our sequential fusion approach with different clusters,
first on base sign clusters and then on automatically identified clusters based on the joint
confusion matrix.
The merit of a recognition system is its
power to generalize a learned concept to new
instances. In SLR, there are two kinds of new
instances for any given sign: signing of somebody whose videos are in the training set
and signing of a new person. For experimental
purposes, the former is signer-dependent and
the latter is signer-independent. To determine the real performance of an SLR system,
we must measure it in a signer-independent
experiment.
To report the base sign accuracy, we assume
that a classification decision is correct if the classified sign and the correct sign are in the same
base sign cluster. The base sign accuracy is
important for the success of our sequential
fusion method based on sign clusters. The
overall accuracy reports the actual accuracy
of the classifier over all the signs in the dataset.
These accuracies are reported on two protocols: the signer-independent protocol and the
signer-dependent protocol.
Signer-independent protocol
In the signer-independent protocol, we
make up the test sets from performance instances of a group of subjects in the dataset, and
train the system with sample signs from the
rest of the signers. For this purpose, we apply
an eight-fold cross-validation, where each fold
test set consists of instances from one of the
88
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Table 3. Signer-independent test results.
Classifier/
Subject 1
Subject 2
Subject 3
Subject 4
Subject 5
Subject 6
Subject 7
Subject 8
Average
HMMM/base sign
100.0
100.0
100.0
100.0
96.84
100.0
100.0
100.0
99.61
HMMM&N/base sign
100.0
100.0
100.0
100.0
97.89
100.0
100.0
100.0
99.74
HMMM&N →HMMN/
base sign
HMMM/overall
100.0
100.0
100.0
100.0
97.89
100.0
100.0
100.0
99.74
accuracy type
66.32
77.90
60.00
71.58
57.90
81.05
52.63
70.53
67.24
HMMM&N/overall
73.68
91.58
71.58
81.05
62.11
81.05
65.26
77.89
75.53
HMMM&N →HMMN/
overall for base sign
clusters
82.11
72.63
73.68
88.42
56.84
80.00
81.05
71.58
75.79
HMMM&N →HMMN/
overall for automatically
identified clusters
85.26
76.84
77.89
89.47
63.16
80.00
88.42
75.79
79.61
signers and the training set consists of instances
from the other signers. In each fold there are 665
training instances and 95 test instances. Table 3
gives the results of the signer-independent
protocol.
The base sign accuracies of each of the three
classifiers in each fold are 100 percent except
for the fifth signer, which is slightly lower.
This performance result shows that a sequential
classification strategy is appropriate, where specialized models are used in a second step to
handle any intracluster classification problem.
We can deduce the need for head-feature data
from the high increase in overall accuracy with
the contributions of nonmanual features. With
HMMM&N, the accuracy increases to 75.5 percent, which contrasts with the accuracy of
HMMM at 67.2 percent. A further increase in
accuracy is obtained by using our sequential
fusion methodology with automatically defined
clusters. We report the accuracy of the sequential
fusion with the base sign clusters to show that
using those semantic clusters results in an accuracy 4 percent lower than automatic clustering.
Signer-dependent protocol
In the signer-dependent protocol, we put
examples from each subject in both of the test
and training sets, although they never share
the same sign instantiation. For this purpose,
we apply a five-fold cross-validation where at
each fold and for each sign we place into the
training set four out of the five repetitions of
each subject and the remaining one into the
test set. In each fold there are 608 training examples and 152 test examples. Table 4 shows the
results of the signer-dependent protocol.
The base sign accuracies of each of the three
classifiers are similar to the signer-independent
results, except the overall accuracies are much
higher. Additionally, the sequential fusion
technique does not contribute significantly
any longer, probably as a result of the ceiling
effect and the already high accuracies.
Table 4. Signer-dependent test results.
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Average
HMMM/base sign
99.34
100.0
100.0
92.76
92.76
92.76
100.0
100.0
100.0
89.47
95.39
95.39
98.68
98.68
98.68
89.47
93.42
92.76
100.0
100.0
100.0
94.08
96.05
95.39
100.0
100.0
100.0
92.76
94.08
94.08
99.61
99.74
99.74
91.71
94.34
94.08
92.76
95.39
92.76
96.05
94.08
94.21
HMMM&N/base sign
HMMM&N →HMMN/base sign
HMMM/overall
HMMM&N/overall
HMMM&N →HMMN/overall for
base sign clusters
HMMM&N →HMMN/overall for
automatically identified clusters
January—March 2009
Classifier/accuracy type
89
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
5
Score
4
3
S1
S2
S3
2
S4
S5
S6
Average
1
Computer use
level
I would like to
use SignTutor
to assist sign
language learning
I found
SignTutor
easy to use
Most people
could learn to
use SignTutor
quickly
User study
questionnaire results.
We have conducted a user study to measure
SignTutor’s real-life performance and to assess
the overall system’s usability. Our subjects were
volunteer students, two males and four females,
taking an introductory Turkish Sign Language
course given at Boğaziçi University. Two of the
students (one male and one female) were from
the computer-engineering department and the
rest were from the foreign language education
department. All subjects were highly motivated
to take part in the experiment and were excited
about SignTutor when they were first told
about the properties of the system.
We performed the experiments in two sessions. In each session, we asked subjects to
practice three signs. We conducted the second
session after a time lapse of at least one week
following the first session. Before the first session, we presented the SignTutor interface to
the subjects in the form of a small demonstration. At the second session, we expected the
users to start using the SignTutor without any
introductory presentation. For each subject we
measured the time spent on each task, with
each task being defined as the learning, practicing, and evaluating of one of the three signs.
At the end of the experiment, we interviewed the subjects and asked them to fill out
a short questionnaire. We asked three questions
in the questionnaire and had the subjects score
their responses using five levels, from strongly
disagree (1) to strongly agree (5). As Figure 5
shows, the average score for all questions is
IEEE MultiMedia
Figure 5. Usability
more than four, indicating the subjects’ favorable views regarding the system’s usability.
In session one, we asked the subjects to practice three signs: afraid, very afraid, and drink (as
a noun). We selected these three signs because
the first two are variants of the same base sign
performed by two hands; the third is a completely different sign performed by a single
hand. We measured the number of seconds for
each task; Figure 6a (next page) shows the results.
The subjects practiced each sign until they
received a positive feedback. The average number
of attempts was around two. The average time
was 85 seconds for the first task and decreases
to 60 seconds for the second and third tasks.
These results show that the subjects were more
comfortable using the system after the first task.
In session two, we asked the subjects to practice another set of signs: clean, very clean, and
to drink. As in the first session, the first two
signs are variants of the same base sign performed by two hands; the third is a completely
different sign performed by a single hand. Out
of the six subjects who participated in the first
session, only five participated in the second.
In this session, the subjects started using the
system without any help or explanation about
usage. Figure 6b shows the results, which reflect
that the subjects recalled the system usage
without any difficulty and that the average
time spent completing a task decreased with
respect to the first session. The average number
of attempts was again around two for the first
two signs, similar to the first session, which
shows that the subjects performed a new sign
correctly after roughly two attempts. For the
third sign, all the subjects succeeded in their
first attempt.
During the interviews, the subjects made
positive comments about the system in general.
They found the interface user friendly and the
system easy to use. All of the subjects indicated
that using SignTutor during sign language
learning helped them to learn and recall the
signs. The subjects found the practice feature,
especially its ability to analyze both hand and
head gestures, important. They commented
that the possibility of watching the training
videos and their own performance together
with the animation made it easier to understand and perform the signs.
They noted, however, that it would be nice
to have more constructive feedback about
90
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
120
S1
S2
S3
S4
S5
S6
Average
3
2
Average time (seconds)
Trials per task
4
Figure 6. Task analysis
S1
S2
S3
S4
S5
S6
Average
100
80
60
40
for (a) session one and
(b) session two. For
each session, the
number of trials and
the average time on a
task is plotted.
20
0
1
Afraid
Very afraid
Drink
(noun)
Afraid
Very afraid
Drink
(noun)
(a)
120
S1
S2
S3
S4
S5
Average
3
2
1
Clean
Very clean
To drink
Average time (seconds)
Trials per task
4
S1
S2
S3
S4
S5
Average
100
80
60
40
20
0
Clean
Very clean
To drink
(b)
Conclusion
There are several avenues through which
this proof-of-concept tutor system can be
improved. Both signer-independent and
signer-dependent performance results indicate
that invariance properties of the features must
be further investigated. The analysis of nonmanual signals can be improved and other nonmanual signals, such as facial expressions and
body posture, can be incorporated into the system. We can’t expect much aid from signer
adaptation because students are novices,20 but
we can help students in the use of SignTutor
by providing them with sign videos from
other teachers, text-based descriptions of the
sign, pictures of the hand shapes used during
the sign, and an interactive 3D animated avatar. The 3D avatar, an advanced avatar based
on the one used in this study, lets users watch
the sign from different viewpoints. Finally, we
are considering giving more instructive feedback to trainees by offering hints, in addition
to the present error message, about the correct
position of hands with respect to the body
and each other.
MM
January—March 2009
their own performance, explaining what is
wrong and what should be corrected. One of
the subjects found the decisions of SignTutor
too strict, and noted that the parameters for
determining a correct sign could be relaxed.
The subjects had no difficulty in using the
glove but commented that it would be better
to use the system without any gloves. One of
the subjects suggested adding a picture, possibly a 3D image, of the hand shape used during
the sign, in addition to the sign video. In
theory, this picture could help students understand the exact hand shape performed during
signing.
During the experiments, we were able to
observe the system performance in a real-life
setting. An important system requirement is
that the camera view should contain the
user’s upper body. In addition, the distance
between the camera and the user should be at
least one meter. We performed the experiments
in an office environment without using any
special lighting device. The hand segmentation
requires a reasonably illuminated environment,
and the user’s clothes should be of a different
color than that of the gloves. We found that
when these conditions are met, the system
works accurately and efficiently.
Acknowledgments
We developed most of this work in a joint
project during the Enterface 2006 Workshop
on Multimodal Interfaces, which is supported
91
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
by EU FP6 Network of Excellence SIMILAR, the
European Taskforce Creating Human—Machine
Interfaces SIMILAR to Human—Human Communication. This work has also been supported
by Tubitak project 107E021 and Boğaziçi University project BAP-03S106.
References
1. S.K. Liddell, Grammar, Gesture, and Meaning in
American Sign Language, Cambridge Univ. Press,
2003.
2. W.C. Stokoe, ‘‘Sign Language Structure: An Outline of the Visual Communication Systems of the
American Deaf,’’ J. Deaf Studies and Deaf Education, vol. 10, no. 1, 2005, pp. 3-37.
3. S.C.W. Ong and S. Ranganath, ‘‘Automatic Sign
Language Analysis: A Survey and the Future
12. O. Aran and L. Akarun, ‘‘Recognizing Two
Handed Gestures with Generative, Discriminative
and Ensemble Methods via Fisher Kernels,’’ Proc.
Int’l Workshop Multimedia Content Representation,
Classification, and Security, LNCS 4105, SpringerVerlag, 2006, pp. 159-166.
13. I. Fasel et al., Machine Perception Toolbox, API
Specification for MPT-1.0, UCSD Machine Perception Lab.; http://mplab.ucsd.edu/grants/project1/
free-software/MPTWebSite/API/.
14. W. Beaudot, The Neural Information Processing in
the Vertebrate Retina: A Melting Pot of Ideas for
Artificial Vision, doctoral dissertation, Dept. Computer Science, INPG (France), 1994.
15. A. Benoit and A. Caplier, ‘‘Head Nods Analysis:
Interpretation of Non Verbal Communication Gestures,’’ Proc. IEEE Int’l Conf. Image Processing
Beyond Lexical Meaning,’’ IEEE Trans. Pattern
(ICIP), vol. 3, IEEE Press, 2005, pp. 425-428.
Analysis and Machine Learning, vol. 27, no. 6,
16. A.B. Torralba and J. Herault, ‘‘An Efficient Neuro-
2005, pp. 873-891.
4. F.K.H. Quek, ‘‘Toward a Vision-Based Hand Gesture
Interface,’’ Proc. Virtual Reality Software and Technology Conf., G. Singh, S.K. Feiner, and D. Thalmann, eds., World Scientific, 1994, pp. 17-31.
5. Y. Wu and T.S. Huang, ‘‘Hand Modeling, Analysis,
and Recognition for Vision Based Human Computer Interaction,’’ IEEE Signal Processing Mag.,
vol. 18, no. 3, 2001, pp. 51-60.
6. T. Starner and A. Pentland, Realtime American Sign
Language Recognition from Video Using Hidden Markov
Models, tech. report, MIT Media Laboratory, 1996.
7. C. Vogler and D. Metaxas, ‘‘ASL Recognition
Based on a Coupling between HMMs and 3D
Motion Analysis,’’ Proc. Int’l Conf. Computer Vision
(ICCV), IEEE CS Press, 1998, p. 363.
8. G. Fang, W. Gao, and D. Zhao, ‘‘Large-Vocabulary
morphic Analog Network for Motion Estimation,’’
IEEE Trans. Circuits and Systems-I, special issue on
bio-inspired processors and CNNs for vision,
vol. 46, no. 2, 1999, pp. 269-280.
17. L.R. Rabiner, ‘‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,’’ Proc. IEEE, vol. 77, no. 2, 1989, pp. 257285.
18. I.S. Pandzic and R. Forchheimer, MPEG-4 Facial
Animation: The Standard, Implementation and
Applications, Wiley, 2002.
19. K. Balci, ‘‘Xface: Open Source Toolkit for Creating
3D Faces of an Embodied Conversational Agent,’’
Proc. 5th Int’l Symp. SmartGraphics, LNCS 3638,
Springer-Verlag, 2005, pp. 263-266.
20. S.C. Ong, S. Ranganath, and Y. Venkatesha,
‘‘Understanding Gestures with Systematic Varia-
Continuous Sign Language Recognition Based on
tions in Movement Dynamics,’’ Pattern Recogni-
Transition-Movement Models,’’ IEEE Trans. Systems,
tion, vol. 39, no. 9, 2006, pp. 1633-1648.
Man and Cybernetics, pt. A, vol. 37, no. 1, 2007,
pp. 1-9.
9. K.W. Ming and S. Ranganath, ‘‘Representations
Oya Aran is a PhD candidate at Boğaziçi University
working on dynamic-hand-gesture and sign-language
for facial expressions,’’ Proc. Int’l Conf. Control
recognition. Her research interests include computer
Automation, Robotics, and Vision, vol. 2, IEEE
vision, pattern recognition, and machine learning.
Press, 2002, pp. 716-721.
Aran has an MS in computer engineering from Boğaziçi
10. U.M. Erdem and S. Sclaroff, ‘‘Automatic Detection of Relevant Head Gestures in American Sign
University, Istanbul. She is a student member of the
IEEE. Contact her at [email protected].
IEEE MultiMedia
Language Communication,’’ Proc. Int’l Conf. Pattern Recognition, vol. 1, IEEE CS Press, 2002,
Ismail Ari is an MS student in computer engineering
pp. 460-463.
at Boğaziçi University working on facial-feature track-
11. S. Jayaram et al., ‘‘Effect of Color Space Transfor-
ing and recognition for sign language. His research
mation, the Illuminance Component, and
interests include computer vision, image processing,
Color Modeling on Skin Detection,’’ IEEE Conf.
and computer graphics. Ari has a BS in computer
Computer Vision and Pattern Recognition (CVPR),
engineering from Boğaziçi University. Contact him
IEEE CS Press, 2004, pp. 813-818.
at [email protected].
92
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.
Lale Akarun is a professor of computer engineering
Alice Caplier is a permanent researcher in the Image
at Boğaziçi University. Her research interests include
and Signal Department of the GIPSA-Lab. Her
face recognition, modeling, animation, human activ-
research interests include human gesture analysis
ity, and gesture analysis. Akarun has a PhD in electri-
and interpretation based on video data. Caplier has
cal engineering from Polytechnic University, New
a PhD in image processing from the Institut National
York. She is a senior member of the IEEE. Contact
Polytechnique de Grenoble. Contact her at alice.
her at [email protected].
[email protected].
Bülent Sankur is a professor in the department of
Pavel Campr is a PhD candidate and teaching assis-
electric and electronic engineering at Boğaziçi Uni-
tant in the Department of Cybernetics, University of
versity. His research interests include digital signal
West Bohemia, Pilsen, Czech Republic. His research
processing, image and video compression, biometry,
interests include sign language recognition and com-
cognition, and multimedia systems. Sankur has a
puter vision. Campr has an MS in cybernetics from
PhD in electrical and systems engineering from
the University of West Bohemia. Contact him at
Rensselaer Polytechnic Institute, New York. He is on
[email protected].
the editorial boards of three journals on signal processing and a member of European Association for Sig-
Ana Huerta Carrillo works as an engineer on mobile
nal Processing’s Administrative Committee. Contact
systems at Telefónica. Her research interests include
him at [email protected].
computer graphics and animation. Huerta Carrillo
has an MS in electrical engineering from the Speech
Alexandre Benoit is a post-doctoral member of the
Technology Group at the Technical University of
Image Video Communication team at the Institute
Madrid. Contact her at [email protected].
de Recherce en Communications et en Cybernetique
de Nantes (IRCCYN), France, where he works on
François-Xavier Fanard is a PhD candidate in the
image quality perception. His research interests
communication and remote sensing laboratory (TELE)
include visual perception and image processing, ret-
at the Université Catholique de Louvain, Louvain-la-
ina and cortex models, and their use in computer
Neuve, Belgium. His research interests include facial
vision. Benoit has a PhD from the Images and Signals
synthesis using the MPEG-4 facial animation standard
Department of the Grenoble Image, Parole, Signal
and MPEG-4 for Advanced Video Coding video com-
Automatique (GIPSA)-Lab, France. Contact him at
pression. Fanard has a BS from the Université Catholi-
[email protected].
que de Louvain. Contact him at [email protected].
January—March 2009
93
Authorized licensed use limited to: IEEE Xplore. Downloaded on March 31, 2009 at 11:09 from IEEE Xplore. Restrictions apply.

Benzer belgeler

Journal of Naval Science and Engineering 2015, Vol. 11, No.3, pp. 1

Journal of Naval Science and Engineering 2015, Vol. 11, No.3, pp. 1 This filter enhances moving contours with an outer plexiform layer and cancels static ones with an inner plexiform layer. This prefiltering mitigates any illumination changes and noise. Second, we ...

Detaylı