Jan Cychnerski, Tomasz Dziubich
https://link.springer.com/chapter/10.1007/978-3-030-85082-1_20
The acquisition of high-quality data and annotations is essential for the training of efficient machine learning algorithms, while being an expensive and time-consuming process. Although the process of data processing and training and testing of machine learning models is well studied and considered in the literature, the actual procedures of obtaining data and their annotations in collaboration with physicians are in most cases based on the personal intuition and suppositions of the researchers.
This article focuses on investigating various practical aspects of medical data acquisition and annotation, as well as various methods of collaboration between IT and medical teams to build datasets that fulfill the desired quality, quantity, and time requirements. Based on five projects undertaken by the authors in diverse medical fields, in which the dataset construction procedure was iteratively optimized, a set of guidelines and good practices to be followed when building new medical datasets was developed as described.