Page 162 - Kaleidoscope Academic Conference Proceedings 2024
P. 162
2024 ITU Kaleidoscope Academic Conference
relevance, clarity, and clinical validity. In addition to
evaluating system performance, we place a strong emphasis
on safety and robustness under real-world conditions. This
includes:
• Stress testing on adversarial or edge-case inputs to
identify potential failure modes
• Bias and fairness audits to detect and mitigate
performance gaps across different user subgroups
• Security and privacy assessments to protect user
data and prevent misuse
• Expert oversight and approval from clinical and
ethics review boards
Figure 2: Bar chart of average user ratings
• Staged rollouts with continuous monitoring and
improvement based on user feedback To further validate practical usability, we conducted a user
study with 50 participants (40 patients, 10 providers) who
We engage closely with domain experts and stakeholders used the system for health queries and tasks over one week.
throughout the development lifecycle to inform our research After each interaction, they rated the responses on several
roadmap, evaluation criteria, and deployment decisions. This key attributes using a 1-5 Likert scale.
helps ensure our work remains grounded in real healthcare
needs and contexts. Figure 2 shows the distribution of user ratings broken out by
patients and providers. Patients reported high overall
4. RESULTS satisfaction, with relevance, usefulness, and clarity scores all
above 4.2 on average. Providers also viewed the system
Table 1 presents the results of automated evaluation on a positively, with medical accuracy rated 4.0 on average.
held-out test set of 1,000 user health queries. We report These results demonstrate our approach can provide valuable,
perplexity, BLEU-4, and ROUGE-L scores to measure the user-friendly support for real health needs.
quality and fluency of generated responses, along with
factual accuracy assessed using a medical textual entailment Table 2 – Example User Queries and Generated
model. Responses
Table 1 – Automated Evaluation Results Query Response
Metric Score What are To help manage your hypertension, consider
Perplexity 8.5 some lifestyle the following lifestyle changes: (1) Adopt a
changes I can healthy diet rich in fruits, vegetables, whole
BLEU-4 0.38 make to grains, and low-fat dairy, while limiting
ROUGE-L 0.45 manage my saturated fats and sodium. The DASH diet is
high blood
Factual Accuracy 92% pressure? often recommended for blood pressure
control. (2) Engage in regular physical
Our model achieves strong results, with low perplexity (8.5) activity, aiming for at least 150 minutes per
and high BLEU/ROUGE scores, indicating its generated text week of moderate-intensity aerobic exercise,
is fluent, coherent, and closely matches reference responses. such as brisk walking. (3) Manage stress
The factual accuracy of 92% is also promising, showing the through relaxation techniques like deep
knowledge retrieval augmentations help keep the model breathing, meditation, or yoga. (4) Limit
grounded in reliable medical information. alcohol consumption and quit smoking, as
both can raise blood pressure. Be sure to
work with your healthcare provider to
monitor your blood pressure and adjust your
treatment plan as needed.
– 118 –