Page 162 - Kaleidoscope Academic Conference Proceedings 2024
P. 162

2024 ITU Kaleidoscope Academic Conference




           relevance,  clarity,  and  clinical  validity.  In  addition  to
           evaluating system performance, we place a strong emphasis
           on safety and robustness under real-world conditions. This
           includes:

               •   Stress testing on adversarial or edge-case inputs to
                  identify potential failure modes

               •   Bias  and  fairness  audits  to  detect  and  mitigate
                  performance gaps across different user subgroups

               •   Security  and  privacy  assessments  to  protect  user
                  data and prevent misuse

               •   Expert  oversight  and  approval  from  clinical  and
                  ethics review boards
                                                                    Figure 2: Bar chart of average user ratings
               •   Staged  rollouts  with  continuous  monitoring  and
                  improvement based on user feedback          To further validate practical usability, we conducted a user
                                                              study with 50 participants (40 patients, 10 providers) who
           We  engage  closely  with  domain  experts  and  stakeholders   used the system for health queries and tasks over one week.
           throughout the development lifecycle to inform our research   After each interaction, they rated the responses on several
           roadmap, evaluation criteria, and deployment decisions. This   key attributes using a 1-5 Likert scale.
           helps ensure our work remains grounded in real healthcare
           needs and contexts.                                Figure 2 shows the distribution of user ratings broken out by
                                                              patients  and  providers.  Patients  reported  high  overall
                              4.  RESULTS                     satisfaction, with relevance, usefulness, and clarity scores all
                                                              above  4.2  on  average.  Providers  also  viewed  the  system
           Table  1  presents  the results  of  automated  evaluation on  a   positively,  with  medical  accuracy  rated  4.0  on  average.
           held-out  test  set  of  1,000  user  health  queries.  We  report   These results demonstrate our approach can provide valuable,
           perplexity, BLEU-4, and ROUGE-L scores to measure the   user-friendly support for real health needs.
           quality  and  fluency  of  generated  responses,  along  with
           factual accuracy assessed using a medical textual entailment   Table 2 – Example User Queries and Generated
           model.                                                                 Responses

                  Table 1 – Automated Evaluation Results          Query                  Response

                 Metric                     Score              What are     To help manage your hypertension, consider
                 Perplexity                  8.5               some lifestyle   the following lifestyle changes: (1) Adopt a
                                                               changes I can   healthy diet rich in fruits, vegetables, whole
                 BLEU-4                     0.38               make to      grains, and low-fat dairy, while limiting
                 ROUGE-L                    0.45               manage my    saturated fats and sodium. The DASH diet is
                                                               high blood
                 Factual Accuracy           92%                pressure?    often recommended for blood pressure
                                                                            control. (2) Engage in regular physical
           Our model achieves strong results, with low perplexity (8.5)     activity, aiming for at least 150 minutes per
           and high BLEU/ROUGE scores, indicating its generated text        week of moderate-intensity aerobic exercise,
           is fluent, coherent, and closely matches reference responses.    such as brisk walking. (3) Manage stress
           The factual accuracy of 92% is also promising, showing the       through relaxation techniques like deep
           knowledge  retrieval  augmentations  help  keep  the  model      breathing, meditation, or yoga. (4) Limit
           grounded in reliable medical information.                        alcohol consumption and quit smoking, as
                                                                            both can raise blood pressure. Be sure to
                                                                            work with your healthcare provider to
                                                                            monitor your blood pressure and adjust your
                                                                            treatment plan as needed.











                                                          – 118 –
   157   158   159   160   161   162   163   164   165   166   167