Deep dive into ML: data annotation and learning approaches

What data annotation is and why it matters

The operational principle of Machine Learning (ML) systems is related to their comprehension and learning capabilities, that is to say the ability to identify common correlations and patterns within a group of data and define the “rules” to complete a certain activity.

For this reason, accessing a large amount of data represents an important starting point for the design and implementation of an ML system : the more data is available, the more accurate both the learning process and the performance will be. However, even more important than quantity is the availability of quality data, that is clean and possibly annotated data.

Having an annotated dataset means having data associated with a label that entails important information. For example, an annotated data provides details about its meaning in a given context, its classification, or guidelines for its comprehension; in this way, the machine is able to easily analyze that data and use it as a learning base.

Indeed, data annotation determines the learning methodology which will be used by the ML system: when an annotated dataset is available, an approach called supervised learning is used; on the contrary, if there is an unannotated dataset, a convenient technique to apply is called unsupervised learning.

What does supervised learning mean?

Let’s imagine we have an annotated dataset of photos: some of them portray dogs and have been categorized with the label “dog”, while others illustrate cats and are identified with the label “cat”. The final objective is to teach a ML system to autonomously distinguish the two species; for doing this, data scientists show the annotated dataset to the ML model, which will look for patterns (similarities) among all those pictures labelled as “dog”, as well as among all those ones labelled as “cat”. In this way, the system will be able to predict which category a new photo of a dog or cat belongs to.

And what about unsupervised learning?

If, on the other hand, we don’t have an annotated dataset, it means that we can no longer speak of supervised learning, but of unsupervised learning. This approach involves the use of complex techniques, whereby the ML model finds possible annotations by itself, i.e. by grouping together similar data and creating “clusters” that have common patterns with each other.

If we have alternatives to supervised learning, why is annotated data so important?

The answer is simple: ML algorithms based on a supervised learning approach generally get better performances in a shorter time.

However, in certain situations, unsupervised learning is the best choice. An example of application regards credit card fraud detections: fraudulent transactions are generally (and fortunately!) fewer than normal ones. Using a supervised approach, the ML system would struggle to classify new frauds due to insufficient data labelled as “fraud” to learn from. The application of an unsupervised approach, instead, would allow the ML system to observe data as a whole and to autonomously group the most similar cases within the dataset in order to define its own “rules” for analyzing and classifying future transactions.

Although results may be less accurate, in this case the unsupervised approach is more appropriate than the supervised one because it is able to detect all types of frauds that might occur - i.e. including those brand-new that have never been registered before.

To better understand the difference between the two concepts, let’s imagine that we, humans, are the teachers, and the algorithm is the student who learns from us the definition of “fraud” and uses it to assess whether a transaction is fraudulent or not (supervised learning). However, as soon as the student finds a new version of fraudulent transaction (never seen before), it will be rather unlikely that (s)he will be able to recognize it because that fraud would not respect the definition learned. When unsupervised learning is applied, instead, the student does not have a definition to use, so (s)he would naturally group similar transactions and classify frauds as transactions that are just different from those that occur more frequently.

The best out of the two

As you may notice, both approaches have pros and cons. While unsupervised learning comes in handy in certain situations, supervised learning is still preferable in most cases. Yet, we are living in a world where the production of data increases exponentially, and the annotation process is an operation requiring a lot of time (the larger the dataset, the longer the annotation process). So, what if a semi-supervised approach is used? New techniques are emerging, which are able to combine the advantages of both approaches and require the annotation of only a part of the dataset. In this way, it is possible to equip data scientists with sector-specific knowledge while providing the ML system with a learning base that can ensure better accuracy and control over results.