Paper Title
Evaluation of a Multimodal Custom Finetuned LLM for Virtual Healthcare Consultations
Abstract
We present a modular, privacy-conscious prototype for multimodal agency with retrieval-augmented generation (RAG) for a virtual medical assistant in healthcare consultation. The system features a locally deployed LLaMA 3.2 11B with 4-bit quantization to keep the model small yet efficient. The model directly accepts both images and text and has been fine-tuned using 50,000 image label pairs. The image label pairs are taken from the MedTrinity dataset, which consists of a wide variety of medical-related image-text pairs. The model was fine-tuned to enhance multimodal query answering in medical contexts. Text, image, and speech inputs are all supported. Speech is transcribed via the Assembly AI transcription API. For retrieval-augmented generation, ChromaDB semantically stores indexed medical documents sourced from the MedQuAD dataset, where 41,000 medicine-related question–answer pairs are stored.
We evaluate the finetuned model by comparing it with the base model, both of which are compared with and without the support of Retrieval Augmented Generation (RAG). We assess the response via LLM as a judgement criterion via OpenAI’s GPT-4.1. We use strict vs nonstrict evaluations of the model against the MMMU benchmark. For the MMMU dataset, we select the fields of basic medical science, clinical medicine, and diagnostic & laboratory medicine. Each field was evaluated with 30 questions per LLM variant with or without RAG support.
Keywords - Multimodal, Retrieval-Augmented Generation (RAG), ChromaDB, LLaMA 3.2 11B, 4-bit Quantization, GPT-4.1