Blog: Remove unwanted objects from photos using an automated macOS solution

Have you ever missed a photo op because you were in a crowded place and random people wouldn’t get out of your way? Sometimes it can be almost impossible to take a clear photo in the real world. If you decide to try and wait for the people to clear out of busy areas, you might be standing there waiting forever! But, no more waiting, we have a much more time-effective solution! Watch the following video to learn more about the approach that we’ve used to remove unwanted objects from your cool photos.

How does it work?

Through our thorough research, we’ve discovered that we can attain crystal-clear still framed pictures from multiple, busy or crowded pictures. Just imagine if it were possible to take a number of pictures and then combine their best parts into one picture. Of course, doing this manually doesn’t make sense, but now with the latest Apple technologies, we can do it automatically. You won’t even need to edit your pictures in Photoshop!

All we need is a short video. Ten seconds or even five seconds will do. However, we’ll use only a certain number of frames from this video. Basically, 5 frames should do the trick, but you may opt for 7 or more frames to get an even crispier outcome. 🙂

We created a basic demo macOS application that leverages this algorithm to remove the unwanted moving objects from your pictures. This application uses generic Apple frameworks to deliver some amazing results. Basically, the application takes a short video to create a photo without any moving objects in it!

Technological background

First of all, we use the AVFoundation framework to extract 5 frames from the source video. Regularly, these frames are evenly distributed across the source video, but you may consider changing them manually.

aligning-objects

The first real challenge for our application is to align frames. This may not be required if you use a tripod to film the source video or if you use optical stabilization on your camera. However, very often this is not the case, so, the frames alignment step becomes vitally important.

Aligning images

We created a homographicTransform function, which takes two images and tries to create a homographic matrix using the built-in functions from the Vision framework. This is possible only if the images contain the same objects. Otherwise, for example, if you provide two completely different images as an input for this function, it will return nil. If the operation succeeds, the function returns a 3×3 matrix that allows for the transformation of source frame, aligning it according to the basic frame. We use the third (the middle) frame as the basic frame because this approach requires minimal transformations for other frames.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func homographicTransform(from image: CIImage, to reference: CIImage) -> matrix_float3x3? {
// Create the request and request handler
let request = VNHomographicImageRegistrationRequest(targetedCIImage: image);
let requestHandler = VNImageRequestHandler(ciImage: reference, options: [:]);
// Send the request to the handler
try? requestHandler.perform([request]);
// Get the observation
guard let results = request.results,
let observation = results.first as? VNImageHomographicAlignmentObservation
else {
return nil
}
return observation.warpTransform
}

On the next step, we use Core Image to transform frames 1, 2, 4, and 5, aligning them according to the third frame. So, now we’ve got 5 aligned images, where the static objects are placed in the same spots.

Delivering the result

The second major challenge is to create one output image from this set of images.

We make the resulting photo from the median of the aligned frames. In order to determine the median, we use a Bose-Nelson sorting network. This algorithm is pretty fast and it can deliver near real-time results.

1
2
3
4
5
6
7
8
9
10
inline void swap(thread float4 &a, thread float4 &b) {
float4 tmp = a; a = min(a,b); b = max(tmp, b); // swap sort of two elements
}
float4 medianReduction5(sample_t v0, sample_t v1, sample_t v2, sample_t v3, sample_t v4)
{
// using a Bose-Nelson sorting network
swap(v0, v1); swap(v3, v4); swap(v2, v4); swap(v2, v3); swap(v0, v3);
swap(v0, v2); swap(v1, v4); swap(v1, v3); swap(v1, v2);
return v2;
}

As you can see, the function returns the middle value of the sorted set, which is the median. We can’t use an average value in this case, because this will not work for the colors of the resulting picture.

Main obstacles that we faced

When testing our application, we’ve discovered 3 main obstacles. This is how we approached them.

  1. Shaky camera — our app should align frames before the algorithm starts removing unwanted objects. You can achieve the best results by using a camera tripod.
  2. Slowly moving objects — try to film video when these objects move out of the frame because the chances are that these objects will stay on the resulting picture.
  3. Relatively big objects too close to the camera — large objects have proven to be a major obstacle for getting clear results. Once again, try to choose another place to film the video or just film it when the objects stop moving.

Contact us!

As a result of our research, we created the macOS application, leveraging the built-in macOS features. We focused on getting better performance and functionality with less effort so that we can use the algorithm to deliver the results in real-time. We can implement the same algorithm on mobile devices to make it compatible with iPhone and iPad.