Understanding Seervision's Limitations

A primer on what Seervision can, and can't do

We often get questions from customers and prospectives on what use-cases work well with Seervision, or some unexpected failure modes. In this article, the goal is to walk you through how Seervision works behind the scenes, and what that means for your use case.

Seervision's Computer Vision

When we started out with Seervision, our goal was to make a system that was as flexible as possible. We wanted to avoid intrusive hardware when reasonable, while maintaining the flexibility of a human operator. Using Computer Vision is a great way to achieve that goal. Our Computer Vision only needs a single cue: does the subject look human? Regardless of the angle, whether we see a face or not, if the subject of interest “looks” human, we can track them.

With Seervision, provided there is enough visual information available (see below on what that means), we can track any VIP that looks like a human. We don’t require seeing the face (we can track a VIP that’s got their back turned to us, or is at an angle), nor do we require seeing the entire body.

We can’t track arbitrary objects. From a scientific point of view, Computer Vision can definitely track anything you want it to, that’s how flexible it is. However, specifically Seervision’s Computer Vision networks have been trained to look only for humans, making them faster and more specialised at their task.

How is the Computer Vision implemented? Without going into too much of the gory technical details, here is a rough overview:

  1. We take the live video signal, and convert it down to a lower resolution and framerate, so that we can fit more images per second into the GPU (where our analysis runs).
  2. We analyse the image frame and detect any humans that may be present
  3. If we’re tracking, we locate the VIP, and compare to where the user set the reference point
  4. We send commands to the PTU to update its position if necessary
  5. Repeat from step 1. This happens multiple times per second.

Where Computer Vision struggles

Of course, Computer Vision isn’t a silver bullet for tracking (not yet anyways!), and there are a few limitations that are important to understand.

  • Seervision does not “recognise” or “remember” talents by default. Customers often ask if they can be remembered, in case they walk out of the video feed and back into it. For GDPR reasons, Seervision does not do this by default. You can enable the ‘person specific automation’ (beta) option whereby you can store a person’s likeness or simply upload a photo. In doing so, we can remember talents and always associate the same ID to a specific talent. This way you can track specific people or only, enable person-specific workflows.
  • So how do we distinguish talents? Seervision uses aggregate clues – think of facial features, hair color, person size, color of their clothes, etc. These cues help us distinguish talents from one another. This also means that if talents all wear the same uniform, distinguishing them becomes more complex.
  • Seervision is not suitable for extremely wide shots. In these shots, the talent that is to be tracked will often be smaller than approximately 1/3rd of the frame (in height). That means that our Computer Vision has very few pixels or ‘visually identifying information’ to go on. In other words, when a person is that small in the frame, there isn’t enough visual information available for us to differentiate between people robustly or do consistent, smooth tracking.
  • Seervision detects up to a maximum of 8 people in the frame. This isn’t actually true, we detect every single human-like object in the frame, but you’ll notice that from approximately 8 talents onwards, the system can start to slow down, as it has to analyse the state of each available talent. Note that with some smart usage of the Tracking Zone / Exclusion Zone (see our manual), you can exclude areas from the shot to reduce the load on the system.
  • Visually noisy environments can be problematic. With “visually noisy”, we mean situations like excessively flashing lights, smoke, noisy backgrounds, or a combination of all, often found on stages for music festivals for example. While our algorithms have grown robust, these kinds of effects make VIPs look temporarily “different”, impeding tracking.
  • Seervision is not suitable for sports. As an extension of everything above, most sports are both too fast, have too many crossings between players (especially in the case of team sports), and all talents “look the same” (same uniform, size, behaviour).

The key to success

Knowing these limitations, how can you make sure you’re successful when running a Seervision-powered system? We’ve seen quite a few successful deployments at this stage, and we’ve seen the same pattern each time: start out simple with the recommended use-cases, and as you grow more comfortable, explore further out. Concretely:

  • Make your first few deployments with Seervision centered around straightforward presentation use-cases, such as keynotes, lectures, and conferences. This will allow you to get comfortable with the features and get to know the system’s robustness.
  • As you grow more confident, start experimenting and see what Seervision can enable for your specific usage scenario!