Continuous Gestural Interaction with Mobile Devices


Introduction
Publications People

Introduction
In many current mobile devices (e.g. mobile telephones) or personal digital assistants (PDAs) interface designs and interaction techniques have been taken straight from standard desktop graphical interfaces where screen space and other resources are not a problem. This has resulted in devices that are hard to use, with small text that is hard to read, cramped graphics and little contextual information
. Sound and gesture are an important way of solving these problems[4]. Another problem is that if users are performing tasks whilst walking, running or driving, they cannot devote all of their visual attention to the mobile device [3].Visual attention must remain with the main task for safety. It is therefore hard to design a visual interface that can work well under these circumstances. An alternative, sound and gestural interface would require less visual attention and therefore potentially interfere less in the main activity in which the user is engaged.

Much of the interface work on wearable computers (more complex and fully equipped computers than PDAs) tends to focus again on visual displays, often presented through head-mounted displays [1]. These are often heavy and hard to use in bright daylight, plus they occupy the users’ visual attention [5]. Our novel aim here is to try to create a system that uses little of our users’ visual attention and to see how effective such a system can be. Initial work has shown non-speech audio to be very effective in improving interaction on mobiles[6,7]. It allows users to keep their visual attention on navigating the world around them and allows information to be presented to their ears. Our aim is to develop this further.

The user will wear a pair of lightweight open-backed headphones to hear the sounds (and not obscure the sounds of the real world), which will be spatialised in a plane around the user’s head. The user is holding a PDA. This will be the screen of the wearable, if information must be displayed visually. This will be connected to the wearable via a cable or wireless network connection. There will also be an accelerometer on top of the PDA so that it can be used for pointing or gesturing. The user might also wear a tracker on a finger to allow further pointing or gesturing.

How would such a system as the one we are suggesting work? Whilst walking, a user might point towards an audio source indicating a menu by tilting PDA.  The user might select the audio source of an MP3/wav file and enter to the options of that audio menu in audio space. Sonification methods like Doppler effect, volume , pitch, and timber changes help users to know their position in audio space and selecting targets.
The other novel aspect of this proposal is to use gestures for input to the mobile device. Input is difficult on mobiles, as there is no space for a full keyboard and mouse. Many handheld devices use a stylus to write characters on a touch screen. This can be difficult to do when mobile, as the device and the stylus are both moving, making accurate positioning difficult. There has been little use of physical hand and body gestures as a solution to input on the move. Such gestures are advantageous because users do not need to look at a display to make them (as they must when clicking a button on a PDA with a stylus). Harrison et al. [2] showed that simple, natural gestures can be used in a range of different situations in mobile devices to simplify input (however they never tested their use on the move). Hinckley et al. [3] created a system for a handheld computer that allowed a user to tilt the device for scrolling and rotate it for changing display orientation. They give some initial models for the gestures and we can use these in our own algorithms. None of these systems combined the wide range of hand and body gestures we propose with tightly coupled audio feedback. We believe there are significant usability benefits to be gained from doing this .

Much work has gone into gesture recognition in static situations. For example, hand gestures are often used in virtual environments for control and in sign language recognition [8] this is often done wearing an instrumented glove [12]. Recognition is also often done using video cameras [9]. Both gloves and camera-based systems are not effective for the types of fully mobile applications in which we are interested. The image-processing approach has also had the disadvantage that much of the research effort has gone into the image processing, and not enough into how to model and recognise gestures - an area which is far from well understood. If we have a good modelling framework for gestures, this can be used with any sensing equipment. We therefore plan to use standard motion trackers from InterTrax and Polhemus, data gloves from Essential Reality and MEMS accelerometers from Memsic. These are not all usable in completely mobile settings but we can track within a large enough space to allow users to move around freely (we will also use a GPS receiver with compass for calibration to help reduce sensor drift). We will investigate three basic types of gestures:Head gestures:

  • head movements such as nods or shakes will be used to make selections in the audio space;
  • Hand gestures: using a tracker attached to a finger/ hand users could make pointing gestures in the audio space to select items, move them around, etc.
  • Device gestures: a user might hold a device such as a handheld computer or phone with a tracker on it in his/her hand and use this to point at sound sources as before. It could also be used for simple writing in space in front of the user for basic text input (using a simple Graffiti-like language).

Examples of simple gestures for interacting with mobile phones by physically gesturing with the device.

There are many approaches to advanced gesture recognition, such as artificial neural networks, principal components analysis (PCA), Hidden Markov Models (HMM) [10] and prototype trajectories [11]. There are currently no good solutions to gesture recognition on the move and this project will make a strong novel contribution in this area. The approach will be to view the gesture (Figure in above) not as an observed image to be decoded, but as being the result of a dynamic system running. This approach has been used in the modelling of cursive hand-writing [12] and seems more likely to lead to insight and sustained development of theory and algorithms than the pattern recognition approach. This is expected to be especially true in gestural interaction with mobiles, where we have to understand and ignore the effect of disturbances on the measured gestures that come from movement of the user through the environment. The approach was inspired by Murray-Smith's previous work with helicopter aerodynamics [13, 14]. Learning the motion of an aircraft through space is a closely related problem to that of characterising hand-motion during a gesture.  There is a trajectory through a state space including yaw, pitch and roll, with accelerations and velocities in the x, y and z-axes.  The models will be mostly data-driven, rather than first-principles models of human neuromuscular behaviour and motor control - this would be too involved, and not in keeping with the essentially software engineering/design aims of this research - we want to learn how to build better interactive systems not learn more about the human gesture generating process. This project will make strong use of the latest approaches to modelling complex non-linear systems. We plan to use recent developments in nonparametric statistical inference (Gaussian Process (GP) priors, and Functional Data Analysis (FDA) [15]) to represent complex gestures. FDA is a general framework, which is especially promising for performing inference based on functional information from a number of correlated occurrences, which is identical to the gesture-modelling problem. It has also already been used specifically for dynamic handwriting analysis. Some related approaches based on mixtures of GPs, which has been developed in project GR/M76379 as models of paraplegic patients’ standing-up trajectories, will also be tested [16]. The adaptability provided by data-driven, nonparametric models has many advantages. We can
  • Learn models for individual users, and track changes in their behaviour over time. This can also be used to identify individual users.
  • Learn models for different types of gestures (head, hand, etc.) in different contexts of use/disturbance (standing still, walking etc);
  • Recognise unintended motion which can provide further information about context, which could be used to control the level of interface complexity – e.g. simplifying the interface if the user is running.
More sophisticated gesture recognition provides flexibility, and is likely to lead to smaller body movements being needed, and thus to more social acceptance of the interaction approach.
One reason for the lack of use of gesture recognition systems in the past is that they were not reliable enough. We believe that improved recognition software will help, but that a major breakthrough will be achieved by coupling this with improved feedback. If the user immediately and in a natural manner realises that the gesture has been misunderstood, then regenerating the gesture has a low cost. The key question is how to generate the natural feedback, given the problems of visual display discussed earlier. Our initial work with audio feedback on gestures drawn on the screen of a mobile device to control a music player was very effective when users were on the move [17]. Hermann’s [18] Principal Curve Sonification is similar in ethos, although based on static assumptions, rather than the dynamic models used here. We will look at two novel approaches to the feedback issue:
  • Dynamic systems models allow natural transformations of the state information (e.g. use of the slowly varying parameters of a second-order linear local approximation to the modelled system). This could provide the basis for auditory feedback, feedback on the dynamics of the gesture itself, so that experienced users would hear if a gesture ‘didn't sound right’ and could repeat or correct the gesture immediately. It can also be used to aid learning of the gestures.
  • Probability models can also give immediate feedback on dissonance between possible interpretations of a gesture, before that gesture is complete (e.g. different symbols are allocated different frequencies and the amplitude of each frequency is proportional to the probability of that symbol, conditioned on the gesture data seen so far). We can therefore give feedback on whether a gesture was recognised, and if so, which one, and how uncertain it is. We plan to do this using Markov-Chain Monte Carlo sampling coupled with granular synthesis mechanisms for audio generation.
The methods proposed in this project for deriving sound from gestural interaction are based on fundamental statistical and dynamic features of the mathematical models of gestures and classification, which are then transformed and mapped onto audio display. We believe this makes this proposal unique in terms of the new, interdisciplinary approach to the problem. The most recent publication on this type of work is [19] by (Williamson & Murray-Smith 2002). In this Williamson & Murray-Smith use granular synthesis approaches to link a range of probabilistic models which map observable variables to belief states. This provides an extremely flexible mapping from scientific and engineering models to audio display, and generates audio feedback with much more interesting texture and responsiveness.

Acknowledgments

This project is supported by SFI BRG project Continuous Gestural Interaction with Mobile devices, Science Foundation Ireland grant 00/PI.1/C067, the Multi-Agent Control Research Training Network - EC TMR grant HPRN-CT-1999-00107, and EPSRC grant Audioclouds: three-dimensional auditory and gestural interfaces for mobile and wearable computers GR/R98105/01.

References

[1] Barfield, W. and Caudell, T., Eds. Fundamentals of wearable computers and augmented reality. Lawrence Erlbaum Associates, Mahwah, New Jersey, 2001.
[2] Harrison, B.L., Fishkin, K.P., Gujar, A., Mochon, C. and Want, R. Squeeze me, hold me, tilt me! An exploration of manipulative user interfaces. In Proceedings of ACM CHI'98 (Los Angeles, CA) ACM Press Addison-Wesley, 1998, pp. 17-24.

[3] Hinckley, K., Pierce, J., Sinclair, M. and Horvitz, E. Sensing techniques for mobile interaction. In Proceedings of ACM UIST 2000 ACM Press, 2000, pp. 91-100.
[4]  Hindus, D., Arons, B., Stifelman, L., Gaver, W., Mynatt, E. and Back, M. Designing auditory interactions for PDAs. In Proceedings ACM UIST'95 ACM Press, 1995, pp. 143 - 146.
[5]  Geelhoed, E., Falahee, M. and Latham, K. Safety and comfort of eyeglass displays. In Handheld and Ubiquitous Computing, Thomas, P. and Gellersen, H.W. (Ed.), Springer, Berlin, 2000, 236-247.
[6]  Pirhonen, A., Brewster, S.A. and Holguin, C. Gestural and Audio Metaphors as a Means of Control for Mobile Devices. In Accepted for publication at ACM CHI 2002 (Minneapolis, MN) ACM Press, Addison Wesley, 2002.
[7]  Sawhney, N. and Schmandt, C. Nomadic Radio: speech and audio interaction for contextual messaging in nomadic environments. ACM Transactions on Human-Computer Interaction 7, 3 (2000), 353-383.

[8] Braffort, A. A gesture recognition architecture for sign language. In Proceedings of ACM ASSETS'96 (Vancouver, Canada) ACM Press, 1996, pp. 102 - 109.
[9] Segen, J. and Kumar, S. Gesture VR: vision-based 3D hand interace for spatial interaction. In Proceedings  ACM Multimedia'98 (Bristol, UK) ACM Press, 1998, pp. 455 - 464.
[10] Bregler, C., Omohundro, S., Covell, M., Slaney, M., Ahmad, S., Forsyth, D. and Feldman, J. Probabilistic Models of Verbal and Body Gestures. In Computer Vision in Man-Machine Interfaces, Chipolla and Pentland, A. (Ed.), Cambridge University Press, Cambridge, 1996.
[11] Wilson, A. and Bobick, A. Using configuration states for the representation and recognition of gesture. MIT Media Lab, 1995, Technical Report, 308.
[12] Singer, Y. and Tishby, N. Dynamical encoding of cursive handwriting. 1994
[13] Murray-Smith, R. Modelling human control behaviour with a Markov-chain switched bank of control laws. In Proccedings of the IFAC Symposium on Man-Machine systems (Kyoto, Japan), 1998
[14] Murray-Smith, R., Johansen, T.A. and Murray-Smith, D.J. Modelling Human Control Behaviour and Cooperative Control Systems. Daimler-Benz Research, 1996, Daimler-Benz Technical Report.
[15] Ramsay, J.O. and Silverman, B.W. Functional Data Analysis. Springer, 1997.
[16] Shi, J.Q., Murray-Smith, R. and Titterington, D.M. Hierarchical Gaussian Process Mixtures for Regression. In Proceedings of 5th ICSA International Conference (Hong Kong), 2001.
[17] Pirhonen, A., Brewster, S.A. and Holguin, C. Gestural and Audio Metaphors as a Means of Control for Mobile Devices. In Accepted for publication at ACM CHI 2002 (Minneapolis, MN) ACM Press, Addison Wesley, 2002
[18] Hermann, T., Meinicke, P. and Ritter, H. Principal Curve Sonification. In Proceedings of ICAD 2000 (Atlanta, GA) ICAD, 2000
[19] Williamson, J, and Murray-Smith, R Audio feedback for gesture recognition, DCS Technical Report TR-2002-127, 2002



contact the webmaster