Photo Editing with the Power of Words – By Jakub

Have ever wondered how would you look with other hair colour? Rotate your head just a little higher, with this nice smirk? Or maybe you would like some other lipstick on your favourite photo with best friend? Say no more, AI got you covered!

Figure 1 Generated face (left), Modified beard (centre), Modified hair colour (right) (Note: this photo is artificially generated)

So how is this even working? By taming power of GANs (Generative Adversarial Networks). The project consists of several state-of-the-art neural networks. At its heart is the StyleGAN3 architecture, developed by NVIDIA. This network allows for generation of realistic images. These images can be faces, but not only – they can be anything from landscapes to cars to abstract art! When StyleGAN generates images, it starts with completely random noise, you can imagine it like a static TV screen but in all colors. When our noise goes through the network, it modifies it little by little, so it starts to look like the image we wanted. By the end we are getting high quality image – in this case, photo of person that don’t exist.

But wait, we want to get our photos into network! There is a technique allowing us to put our own images into network space. It’s called “Latent Inversion”, we start with our existing photo and try to find the closest point in networks hidden representation to it. This is possible because network learnt how our faces work and can generate almost any face, real or artificial.

So now we put our photo into the network and what next? We know that StyleGAN is for generating new images not editing current ones! Here comes CLIP network to the rescue. Developed by OpenAI, CLIP make connections between images and text. This network can produce hidden representations of images, text or mix of them. This allows us to tell how close descriptions are to images, it also lets us do some math operations using those hidden representations.

With CLIP and StyleGAN, using some mathematical tricks and optimizations, we can generate images that match our description. What’s more, if we use our own photo, we can modify it by using words to describe appearance, such as “blonde hair” or “red lipstick”.

The last piece is DragGAN network. This network lets us edit generated images in a simple way. It makes it possible for the user to select points on the generated image, and then select destination for these points. Then it will generate a new image, regarding specified changes. It basically lets users make advanced edits like rotating faces on photos with ease!

But wait, there is more. Presented approach is very flexible. It allows for a multitude of use cases. Have you ever wondered how your beloved cat would look as Persian longhair? Or maybe your favorite sofa needs a new look? The available possibilities seem limitless. One is for sure, in coming years