Gaze-driven UI for infants

A project blog for the KTH course Computer Graphics and Interaction (DGI18)

Project report based on the Project specification and on my SUDOA report .

Jonas Nockert

CS student at SU/KTH.

Summer Recap

2 minute read

During the summer it became increasingly clear that I had yet again underestimated this project. While I am more knowledgeable in the subject this year compared to last, there have been setbacks where I did not quite expect them.

The curse of Timm and Barth

A search at the time of writing the project proposal resulted in heaps of interesting material on Github and YouTube. I figured I would have a wide and diverse solution space to learn from but going through it all this summer, I realized that most of them could be placed in either of the following two categories:

Solutions that claimed to do one thing but in reality did another. In many cases explicitly leaving the important part for later
```
TODO Pupil tracking. Using the midpoint between eye corners for now.
```
In other cases, like Apples new Vision API, returning data points for pupils like they are part of the landmark detection when they are actually just based on other detected features (so not useful in this context).
Rewrites of Tristan Hume’s implementation of Timm and Barth abound. I didn’t see it at first because the variable names where different or they were using Matlab, Python or Java instead of C++. Now, this is not a bad solution in itself but the way it is implemented it has quadratic time complexity in the number of pixels so it is slow unless one really narrows down the area to apply it to (which might not be as easy as it sounds).

BioID Face Database

There are a few standard datasets that are widely used for comparing the performance of an algorithm described in a paper to other algorithms trying to solve the same problem, e.g. BioID and Gi4E (which I have not looked into yet as it is not publicly available).

The BioID database has 1521 grayscale photos with a resolution of 384 x 286 pixels, such as the below examples

For some reason I had assumed algorithms were designed for some dataset but also generalized to the reference datasets when in reality, they are optimized for, say, BioID and could generalize poorly to other datasets. At the very least one would have to adjust parameters like kernel and window sizes for images of higher resolution.

What is less obvious is that BioID almost exclusively contain front-facing faces with pupils centered, which means good results on BioID is no guarantee for localizing pupils in images where the person is looking to the side. That is, people might do pupil localization for purposes other than what I have in mind (which is obvious in hindsight).

Since I want to make as few assumptions about the distance to the camera and to the facial alignment as I can, coming up with something that is invariant to, say, scale seems trickier than I first thought.

Kalman filters to the rescue (?)

1 minute read

Using the Pursuits method, only relative pupil movements are needed rather than everything needed for tracking gaze. At first glance, this seems a lot simpler but – relative movements to what?

As an example, what if a person is nodding while looking straight ahead, how do we distinguish that from the person looking up and down with a fixed head, if we are only looking at the eye region? We don’t, at least if we want to avoid complicating things. It’s not really a problem unless the person keeps switching head alignment (which, on the other hand, might very well be the case with infants, I don’t know yet).

So assuming the user is not moving their head, we still need some feature to relate pupil movements to. A common choice is tracking the eye corners but then we have three points per eye to find and track accurately. Are there perhaps other choices that can provide stabile reference points?

One idea is to build on the eye regions detected by OpenCV. Only one problem – they are not very stabile, as can be seen in the video of the previous post.

Now, I had heard about Kalman filters before but did not really know what they were and had never used them. It did look like they could be really useful here so I experimented and finally managed to implement filters for face and eye rectangles (position as well as width and height). Afterward, the results look a lot better:

Again, the difficulty here is finding a good set of parameters, weighing responsiveness to actual movements against noise invariance (see video below).

Kalman filtering also comes with the benefit of handling spurious missed detections. Question is if this is good enough to provide a reference point for us or if eye corner detection or similar is needed?

Using OpenCV to add face and eye detection to the app

1 minute read

Since I have used OpenCV before, I will start with that even though it looks like other options could be better. Even if it is not trivial to integrate with a Swift iOS app, it is not that difficult either. OpenCV is C++ and talking to C++ from Swift can’t be done directly but can be done from Objective C so that is used as a bridge.

OpenCV has a number of pretrained HAAR cascades models suitable for face and eye detection that can be used directly. After testing a lot of different things, I settled on using the haarcascade_frontalface_alt2 model for detecting faces and then within the detected area, using individual detectors for each eye (haarcascade_lefteye_2splits and haarcascade_righteye_2splits). These detectors seem to be faster but tend to fail more often than the more general models so if detection fails, I use a fallback model (haarcascade_eye_tree_eyeglasses, not that infants wear eye glasses, it is just practical for testing on myself).

As can be seen, face and eye detection is pretty good but a slightly different area is found every frame so there is a lot of jitter. Also, the eye regions found seem unnecessarily large vertically, although I’ve noticed that some pupil localization algorithms use the eyebrows, so maybe not.

The most difficult part here is setting good cascade parameters. First, the detections can be slow (and that goes especially for the eye detection) so if we can limit the size range, especially downwards, of the detected features it should help a lot. However, it is not obvious from what distance the app would be used. Perhaps the maximum distance (minimum size) will be limited by the pupil detection accuracy?

Second, there are some parameters that essentially give better accuracy but less detections. They also seems correlated to detection speed to me so I’m not sure if it’s worth it, especially since it does not seem to solve the issue of the detection being unstable in terms of area.

App with camera and a little lag

less than 1 minute read

I’ve set up an Xcode project for an iPad app, coded in Swift, that takes video input from the front camera and displays the individual frames on screen.

As can be seen in the video above, there is a little lag that I do not (yet) know the source of. There are a few potential experiments one could do here:

Find out if this is display lag, image processing lag (only conversions between sensor data and displayable images at the moment), or a combination?
How much lag can the statistical correlation handle between eye and on-screen movement before drifting apart and causing problems?

In regards to (1), since the front camera can’t provide more than 30 fps there will be some lag but I think this looks like more than that. I have double checked that the profiler reports the app updating the image view at 30 fps so that part seem to work as it should.

In regards to (2), eye movements should be small and slow compared to me waving my hand in front of the camera. My initial guess is that this lag is not significant enough to pursue, at least not at this point.

General algorithm for pupil tracking

1 minute read

Given an image as input, the traditional sequence is to detect faces, resulting in a bounded area for each face (1). Given this bound and general anatomical proportions, detect the areas around the eyes (2). For each eye, localize the pupil in the center of the iris (4).

Illustration of pupil detection — Face detection (1), eye detection (2), palpebral fissure reference points (3), and (4) pupil localization. Personal photograph/illustration by author. May 27, 2018.

For direction of the gaze, the algorithm must approximate configuration and position of the head in 3D space as well as the position of the pupil in relation to the palpebral fissure.

Tracking the gaze must then robustly and very precisely handle changes in all the above variables (as well as other variables such as lighting) over time. Like it sounds, this is a difficult problem to solve and why most applications where gaze direction is important use special hardware like head-mounted gaze trackers using infrared light, calibrated for each person using them.

Pursuits, instead of using the direction of gaze to deduce which object is being looked at, deduces the object being looked at by the statistical correlation between the movement of the pupils and the movement of objects. This way the configuration and position of the head is no longer needed and individual differences in the dimensions of palpebral fissure can be ignored, making for a much simpler algorithm.

Modern approach

Since deep neural networks became feasible to use in realtime applications, parts of the traditional approach are no longer needed as facial landmarks can be found directly. With that said, face detection is both robust and fast it is often useful to limit facial landmark detection only to the areas where a face has been found. Similarly for pupil and palpebral fissure localization, it is useful to limit the search to the smaller areas around the eyes.

Jonas Nockert

Recent Posts

Summer Recap

The curse of Timm and Barth

BioID Face Database

Kalman filters to the rescue (?)

Using OpenCV to add face and eye detection to the app

App with camera and a little lag

General algorithm for pupil tracking

Modern approach