Page 197 - Kaleidoscope Academic Conference Proceedings 2024
P. 197

Innovation and Digital Transformation for a Sustainable World




           adaptation, and federated learning methods. In this paper,  input tensors consisted of input IDs, token type IDs, attention
           we address the challenges concerning data sharing, privacy  masks, next sentence labels, and the labels for the MLM
           concerns, and computational resource constraints.  task.  To enhance the BioBERT model, we loaded the
                                                              pre-trained weights and trained it with AdamW optimizer
                         3.  METHODOLOGY                      using a learning rate of 5e-5 for 3 epochs. BioBERT was
                                                              pre-trained on the oncology-specific dataset using MLM and
           Our approach to enhancing oncology care through federated  NSP tasks which resulted in a domain-focused language
           learning and transformer based foundation models involves  model that is able to catch the subtleties and details of
           four key components: data processing and preparation,  oncology-related language.  This model can now more
           domain adaptation, federated learning, and comprehensive  accurately perceive and express domain specific concepts,
           evaluation. In this section, we provide a detailed explanation  terms, and relationships. The MLM task enabled the model
           of each component, along with the underlying techniques and  to acquire contextual representations, while the NSP task
           methodologies.                                     assisted in comprehension of the coherence and sequential
                                                              order of oncology related text. The domain adaptation is
           3.1  Data Processing and Preparation               targeted to improve the model’s performance on oncology
                                                              specific natural language processing tasks like named entity
           To create a domain-specific language model for oncology, we
                                                              recognition, relation extraction, and text classification.
           pre-train the BioBERT model on different kinds of datasets
           related to oncology using masked language modelling
           (MLM) and next sentence prediction (NSP) tasks.  The
           datasets for this task include:
            1. Cancer related trials:  This dataset encompasses
               100,000 cancer trial samples, providing comprehensive
               information on cancer clinical trials, including trial
               descriptions, eligibility criteria, and treatments. 1
            2. PubMed Hallmarks of Cancer Dataset: This dataset       Figure 1 – BERT model training process
               comprises 1,852 publication abstracts related to the
               hallmarks of cancer. 2

            3. Cancer Document Classification:  This dataset  3.2 BERT Model Adaptation and Fine-tuning
               consists of 7,569 cancer document samples, Research
                                                              The overall flow of the BERT model for domain adaptation
               Paper Text field in this dataset was used for training. 3
                                                              on an oncology corpus and fine-tuning for the NER task is
            4. Oncology Patient Medical Reports:  To further  illustrated in Figure 2. The process starts with the pre-trained
               enhance the model’s understanding of oncology specific  BioBERT model, which was trained on a large corpus of
               language, we incorporated 19,253 anonymized medical  biomedical text. To adapt the model to the oncology domain,
               reports belonging to cancer patients.  This dataset  we do some extra pre-training on oncology-related data. This
               provides valuable insights into the language and  in turn helps the model acquire domain specific language
               structure of clinical documentation in oncology.  patterns and vocabulary. This domain adaptation step makes
                                                              the weights of the model more precise and better at capturing
           The MLM task involved randomly masking 15% of the  the nuances and characteristics of oncology. Subsequently,
           input tokens in each sentence and replacing them with the  we fine-tune the domain adapted model on a labelled NER
           [MASK] token, without masking special tokens such as  annotated dataset specific to the oncology domain. At this
           [CLS] and [SEP]. The MLM task’s goal was to identify  stage, we further adjust the model’s weights to capture
           the original masked tokens from the context, which allowed  the specific patterns and features necessary for accurately
           the model to learn domain specific representations of the  identifying named entities within the oncology context. The
           oncology [2].  The NSP task required sentence pairs to  fine-tuning process leverages the information obtained from
           be generated by sampling consecutive sentences (positive  both the general pre-training (
           examples) or non-consecutive sentences (negative examples)  BioBERT) and domain specific pre-training (oncology
           from the dataset. The NSP task made the model learn the  related data). The fine-tuned model obtained in the end can
           sequential nature of oncology related texts and improved its  be used to automatically extract and annotate named entities
           understanding of document structure [16].          from new, unseen oncology text data, thus making it possible
           We used the BioBERT tokenizer to tokenize sentence pairs  for efficient information extraction and analysis in oncology.
           and built input tensors for the model. The tokenizer truncated
           or padded the sequences to a length of 512 tokens. The  3.3 Federated Learning
           1ClinicalTrials.gov
           2huggingface.co/datasets/qanastek/HoC              To enhance the domain specific language model and address
           3kaggle.com/datasets/falgunipatel19/biomedical-text-publication  the challenges of data privacy and centralised training, we



                                                          – 153 –
   192   193   194   195   196   197   198   199   200   201   202