Page 161 - Kaleidoscope Academic Conference Proceedings 2024
P. 161
Innovation and Digital Transformation for a Sustainable World
Figure 1 presents an updated overview of our proposed to downstream health tasks. The pre-trained model is then
system architecture for AI-driven personalized health fine-tuned on the curated health corpus using supervised
services. The system consists of three main components: (1) training objectives, such as next-token prediction or
a user interaction layer, (2) a generative AI model, and (3) a sequence-to-sequence translation. We experiment with
knowledge retrieval engine [28]. The user interaction layer various fine-tuning approaches, including continued pre-
provides natural language interfaces, such as chatbots, voice training on in-domain data, multi-task learning on related
assistants, or mobile apps, for users to input their health health tasks, and instruction-based fine-tuning using prompt
queries, symptoms, or goals. These inputs are translated into templates. Fine-tuning adapts the model to the target health
structured prompts that specify the desired output format and domain and improves its ability to generate relevant,
any relevant patient context. The prompts are then accurate health content. We also explore techniques for safe,
augmented with relevant medical knowledge retrieved from controllable generation, such as:
the knowledge base. The augmented prompts are fed into the
generative AI model, which is a large language model pre- • Controlled decoding methods that constrain model
trained on general-purpose text data and fine-tuned on outputs to align with specified attributes or styles
domain-specific health corpora [29]. The model generates
personalized health information or recommendations as • Safety classifiers that filter or mask potentially
output, tailored to the user's specific prompt and retrieved unsafe or offensive content
context [30]. Techniques for safe and controllable generation,
such as domain-adaptive pretraining, content filtering, and • Reinforcement learning from human feedback to
human feedback, are applied to ensure outputs align with reward desirable behaviors and outputs
verified health guidelines. The knowledge retrieval engine
consists of a knowledge base that stores structured health For model serving, we use a retrieval-augmented generation
data (e.g., ontologies, clinical guidelines, drug databases), (RAG) approach that combines the strengths of the
and a retrieval module that finds relevant information based generative model and knowledge retrieval. Given a user
on the user prompt and generated output. The retriever uses prompt, the retriever first searches the knowledge base for
semantic search techniques (e.g., entity linking, embedding relevant context, such as definitions of medical terms,
similarity) to map natural language to knowledge base clinical guidelines for mentioned conditions, or drug
entries. Retrieved context is passed back to the generative information for queried medications. The retrieved context
model to inform and ground its outputs [31][32]. is appended to the user prompt to create an augmented input
for the generator. The generative model then produces a
3.2 Data and Knowledge Sources contextually appropriate response that is both personalized
to the user's specific query and grounded in the retrieved
Our system leverages a combination of large-scale medical knowledge [39]. The generated output can
unstructured text corpora and structured knowledge bases to optionally be fed back into the retriever for additional fact-
train the generative model and retrieval engine. For pre- checking and refinement.
training the base language model, we use general-purpose
text datasets containing billions of tokens, such as Common 3.4 Evaluation Framework
Crawl [33] and The Pile. For fine-tuning, we curate a health-
specific corpus containing millions of documents from We conduct extensive evaluations of our system using both
authoritative sources such as PubMed [34], UpToDate, automated metrics and human judgments. For automated
Merck Manuals, and MedlinePlus. We apply data cleaning, evaluation, we measure the quality of generated outputs
deduplication, and quality control techniques to ensure the using standard language modeling metrics such as perplexity,
fine-tuning data is relevant, reliable, and representative of BLEU [40] and ROUGE. We also assess the factual accuracy
the target health domains. To build the knowledge base for of outputs by cross-referencing them against ground-truth
retrieval, we integrate existing health ontologies and health information using textual entailment models or
knowledge graphs, such as ICD-11 [35], SNOMED-CT, medical fact-checking APIs [41]. To understand our system's
DrugBank, and UMLS. We also create custom knowledge practical utility and usability, we carry out user studies with
bases by extracting structured information from semi- target stakeholders, including patients, caregivers, and
structured health content, such as clinical practice guidelines, healthcare providers. Study designs include controlled
drug package inserts, and patient FAQs. Knowledge entries experiments comparing our system to existing baselines,
are stored as subject-relation-object triples and indexed using longitudinal field studies examining user engagement and
efficient retrieval algorithms. behavior change, and qualitative interviews probing user
attitudes, needs, and concerns. Participants perform
3.3 Model Training and Inference representative health-related tasks using our system, such as
seeking information about specific conditions, interpreting
The base language model is pre-trained on the general text lab results, or managing chronic illnesses. We collect both
corpus using self-supervised objectives, such as masked objective usage metrics (e.g. task completion time, error rate,
language modeling [36] or permutation language modeling interaction logs) and subjective user feedback through
[37]. Pre-training allows the model to learn generalizable surveys and interviews. Experienced medical professionals
language patterns and representations that can be transferred also review a sample of generated outputs to rate their
– 117 –