TrueFace: the research community against deepfakes

We asked Giulia Boato, associate professor at the Department of Information Engineering and Computer Science (DISI) of the University of Trento, and Sebastiano Verde, colleague and researcher, what this dataset consists of, how it was built and why it is so important.

Eleonora (U-Hopper): Hi Giulia, hello Sebastiano. Let’s go straight to the point: what is TrueFace all about?

Sebastiano: TrueFace is a dataset of images portraying human faces; the singularity of the dataset is that some of these images portray real people while others are synthetically generated, i.e. they have been created by means of Artificial Intelligence (AI) methods. The dataset marks off an important turning point in the scientific community that deals with digital media forensics. Indeed, the purpose of the dataset consists in training new algorithms that are capable of distinguishing authentic photos from artificially generated images, not only under laboratory conditions, but also in realistic and complex scenarios such as images shared on social media. For this reason, in fact, part of the dataset’s images have also been shared - uploaded and downloaded - on various social networks.

Let’s take a step back, can you briefly explain to us what is meant by the term media forensics?

Giulia: Media forensics refers to the forensic analysis of multimedia content, such as images or videos. Specifically, the research field deals with the reconstruction of both additional information on the history of such content (for example, the acquisition device or the author) and the verification of its authenticity. Nowadays, it represents a key research area to determine whether a content shared online faithfully describes the reality or if it is instead the cause of disinformation.

Recently, this research field focused on tackling the problem of identifying computer-generated data (such as deep fakes), which stand out for its photorealism and the ability to trick already-existing detection algorithms.

How did you build the dataset? How long did it take?

Sebastiano: We started with a set of images generated by GANs - a cutting-edge type of Artificial Intelligence neural networks - that reproduce incredibly realistic human faces. We firstly generated about 70,000 artificial faces and secondly, we saved a set of real photos, for a total of 150,000 images. In this way, we built the first part of the dataset, which we named Pre-Social TrueFace. Finally, we shared part of this dataset on a few social media platforms (Facebook, Telegram and Twitter) to build a second version, called Post-Social TrueFace, containing 60,000 images. The collection of the dataset took a few months, especially for the second version, where the images were uploaded and downloaded from social media.

Which are the implications of this new dataset in the scientific field?

Giulia: TrueFace is the first dataset available to the scientific community that contains AI-generated images that have been shared on social media. Therefore, it not only opens up to the possibility of training and testing new algorithms for the identification of artificial content, but it also allows, for the first time, the development of effective ‘detection methods’ even on data that we call ‘post-social data’. You know, in fact, that sharing data online involves a series of processes (compression, resizing, etc.) which compromise the possibility of analysing it from a forensic perspective. At the same time, however, contents shared on social networks are nowadays one of the major sources of information; ensuring their reliability is therefore of primary importance. We hope that TrueFace will become a reference dataset in the international research community for this issue.

And in more ‘practical’ terms? What types of applications can this dataset contribute to?

Sebastiano: TrueFace is part of the TrueBees project , whose goal is to develop a system for authenticating online content by exploiting the synergy between media forensics and blockchain technology. On the one hand, in fact, forensic analysis allows to verify the authenticity of an image, in our case images portraying human faces that have been shared on some social media (please, bear in mind that an algorithm that works on all types of images and on any online site does not exist in the world!); on the other, blockchain provides a secure data structure where to store analysed images.

The dataset plays a key role within the TrueBees project, as it provides a set of data that is necessary to train forensic analysis algorithms, as well as the final validation of the system. This is a first and fundamental step to address the problem, although the extension of the dataset would also ensure greater accuracy. We hope the research community will appreciate the effort made and will contribute to its extension for the common good of being able to identify fake content online.

Are you planning an extension / an update of this dataset in the future? What are the new challenges in the digital forensics field?

Giulia: One of the biggest challenges consists indeed in moving forensic analysis from the laboratory to real conditions, and TrueFace is a first step in this direction. However, this is just a starting point, and its extension would certainly guarantee the possibility of further improving the performance of the algorithms.

Note that this first version of the dataset focuses only on images portraying human faces; it could be interesting, in the future, to extend the study also to different categories of contents and different subjects. Furthermore, we just took into consideration only three social media networks; a greater number and variability could increase its robustness.

Looking even further, another challenge will consist in translating what we have learned about images into the analysis of video-based content. ‘High-quality’ fake videos are, luckly, not as common and widespread as fake images are, but the situation may get worse in the next few years. It is therefore essential to have authentication tools that go hand in hand with modern technologies.

TrueBees is a project carried out by the University of Trento and U-Hopper. It is funded by Trublo under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 957228), which aims at promoting new technologies, such as blockchain, to revolutionise the media sector.