Page 197 - Kaleidoscope Academic Conference Proceedings 2024
P. 197
Innovation and Digital Transformation for a Sustainable World
adaptation, and federated learning methods. In this paper, input tensors consisted of input IDs, token type IDs, attention
we address the challenges concerning data sharing, privacy masks, next sentence labels, and the labels for the MLM
concerns, and computational resource constraints. task. To enhance the BioBERT model, we loaded the
pre-trained weights and trained it with AdamW optimizer
3. METHODOLOGY using a learning rate of 5e-5 for 3 epochs. BioBERT was
pre-trained on the oncology-specific dataset using MLM and
Our approach to enhancing oncology care through federated NSP tasks which resulted in a domain-focused language
learning and transformer based foundation models involves model that is able to catch the subtleties and details of
four key components: data processing and preparation, oncology-related language. This model can now more
domain adaptation, federated learning, and comprehensive accurately perceive and express domain specific concepts,
evaluation. In this section, we provide a detailed explanation terms, and relationships. The MLM task enabled the model
of each component, along with the underlying techniques and to acquire contextual representations, while the NSP task
methodologies. assisted in comprehension of the coherence and sequential
order of oncology related text. The domain adaptation is
3.1 Data Processing and Preparation targeted to improve the model’s performance on oncology
specific natural language processing tasks like named entity
To create a domain-specific language model for oncology, we
recognition, relation extraction, and text classification.
pre-train the BioBERT model on different kinds of datasets
related to oncology using masked language modelling
(MLM) and next sentence prediction (NSP) tasks. The
datasets for this task include:
1. Cancer related trials: This dataset encompasses
100,000 cancer trial samples, providing comprehensive
information on cancer clinical trials, including trial
descriptions, eligibility criteria, and treatments. 1
2. PubMed Hallmarks of Cancer Dataset: This dataset Figure 1 – BERT model training process
comprises 1,852 publication abstracts related to the
hallmarks of cancer. 2
3. Cancer Document Classification: This dataset 3.2 BERT Model Adaptation and Fine-tuning
consists of 7,569 cancer document samples, Research
The overall flow of the BERT model for domain adaptation
Paper Text field in this dataset was used for training. 3
on an oncology corpus and fine-tuning for the NER task is
4. Oncology Patient Medical Reports: To further illustrated in Figure 2. The process starts with the pre-trained
enhance the model’s understanding of oncology specific BioBERT model, which was trained on a large corpus of
language, we incorporated 19,253 anonymized medical biomedical text. To adapt the model to the oncology domain,
reports belonging to cancer patients. This dataset we do some extra pre-training on oncology-related data. This
provides valuable insights into the language and in turn helps the model acquire domain specific language
structure of clinical documentation in oncology. patterns and vocabulary. This domain adaptation step makes
the weights of the model more precise and better at capturing
The MLM task involved randomly masking 15% of the the nuances and characteristics of oncology. Subsequently,
input tokens in each sentence and replacing them with the we fine-tune the domain adapted model on a labelled NER
[MASK] token, without masking special tokens such as annotated dataset specific to the oncology domain. At this
[CLS] and [SEP]. The MLM task’s goal was to identify stage, we further adjust the model’s weights to capture
the original masked tokens from the context, which allowed the specific patterns and features necessary for accurately
the model to learn domain specific representations of the identifying named entities within the oncology context. The
oncology [2]. The NSP task required sentence pairs to fine-tuning process leverages the information obtained from
be generated by sampling consecutive sentences (positive both the general pre-training (
examples) or non-consecutive sentences (negative examples) BioBERT) and domain specific pre-training (oncology
from the dataset. The NSP task made the model learn the related data). The fine-tuned model obtained in the end can
sequential nature of oncology related texts and improved its be used to automatically extract and annotate named entities
understanding of document structure [16]. from new, unseen oncology text data, thus making it possible
We used the BioBERT tokenizer to tokenize sentence pairs for efficient information extraction and analysis in oncology.
and built input tensors for the model. The tokenizer truncated
or padded the sequences to a length of 512 tokens. The 3.3 Federated Learning
1ClinicalTrials.gov
2huggingface.co/datasets/qanastek/HoC To enhance the domain specific language model and address
3kaggle.com/datasets/falgunipatel19/biomedical-text-publication the challenges of data privacy and centralised training, we
– 153 –