Grounding PAiChat in Reality: Building a Reliable Dataset for Histopathology Assistants

Recent advancements in visual language models have shown promising results even when trained on relatively small publicly available datasets. For example, models like LLaVA-Med have demonstrated encouraging capabilities in medical image understanding, despite limited training data and minimal domain supervision [1].

These early findings are exciting, applying such models to specialized fields like histopathology presents greater challenges. These tasks demand domain-specific expertise and suffer from a scarcity of large and diverse, high-quality datasets. Without such datasets, models are more likely to produce hallucinated outputs, misinterpret clinical details, or fail to generalize to real diagnostic scenarios.

At PAICON, we are addressing this challenge by building a domain-specific assistant called PAiChat, trained on expert-level medical content. This blog explains how we prepared a large and diverse dataset of histopathology images paired with text descriptions, including diagnostic explanations, structured responses, multiple-choice questions, summarising features, and Q&A pairs. These were collected and aligned from biomedical articles, online forums, open medical sources, and educational videos, helping the model learn to understand and respond more like a pathologist.

The Problem Statement

Training a reliable assistant for histopathology demands a large and diverse visual instruction dataset - one that teaches the model how to answer questions, explain findings, and reason like a pathologist. This kind of dataset should include not only image-text pairs, but also:

Diagnostic explanations
Structured conversations
Multiple-choice questions (MCQs)
Visual question-answer (QnA) pairs

Such instruction data helps a model move beyond captioning and into real diagnostic dialogue. However, building this type of dataset is currently limited by several challenges in the available data landscape.

Lack of large and diverse training data for histopathology

Current medical VLMs are trained on relatively small instruction-tuning datasets. For example, LLaVA-Med was trained on fewer than 60,000 image–conversation pairs in total, of which only 17K were histopathology-specific [1]. More recent efforts like PA-LLAVA (35K) [2], Quilt-LLaVA (107K ) [3], and PathGen (200K) [4] have scaled up visual instruction tuning. Nonetheless, these datasets still lack the granularity and pathology-specific depth needed to generalize across diagnostic settings, magnification levels, or disease subtypes.
Limited language capabilities in histopathology-specific models

Models like Conch and PLIP are trained on over 1 million histopathology image-caption pairs and perform well in tasks like image-text retrieval or image summarization. However, their language output is typically limited to short captions or patch-level overviews. These models are not designed to generate structured answers, handle medical dialogues, or follow diagnostic reasoning steps, which are essential for instruction-style supervision [5, 6].
Existing datasets lack diagnostic structure or contain noisy instructions

Popular open datasets such as LAION-5B, PMC-OA, and Quilt contain over 100 thousands of image-text pairs but are not optimized for clinical training. Captions are often generic (e.g., “H&E stained image at 40x magnification”) or include metadata like age or sex that cannot be inferred from the image. They rarely contain patch-level descriptions of cell morphology, disease features, or structured QA. This leads to poor supervision quality and risks misleading the model [7, 8, 9, 10]
Lack of annotations across slide magnification levels

Pathologists routinely examine slides (WSI) at varying magnifications (e.g., 5×, 10×, 20×, 40×) to assess both tissue structure and cellular details. This multiscale analysis helps identify features such as abnormal cell shapes, mitotic activity, disorganized architecture, and variations in staining intensity that indicate disease. However, most datasets lack annotations tied to specific magnification levels, limiting a model’s ability to learn how diagnostic focus shifts across scales and weakening clinical relevance.

How We Tackled This at PAICON

To build a reliable visual assistant for histopathology, our goal at PAICON was to create a large, diverse, and instruction-rich dataset that reflects how real pathologists observe, interpret, and discuss medical slides. We focused on collecting data that includes not just captions, but also diagnostic explanations, Q&A pairs, structured responses, multiple–choice questions, and summarised findings, all tied to visual inputs.

We sourced this data from a wide range of platforms:

PubMed Central Open Access (PMC-OA)biomedical articles
Pathologist forums used for second opinions and case discussions
Educational YouTube videos with expert narration
Public datasets like QUILT, LAION, and Multi-Conversation

All data was passed through a multi-stage filtering and structuring pipeline to ensure clinical relevance, proper alignment, and diagnostic value.

PMC-OA Data Curation Pipeline

To build high-quality visual-language instruction data from PMC-OA, we created a tailored pipeline focused on histopathology figures.

Keyword and journal filtering: We selected articles related to pathology terms like “histology,” “biopsy,” “H&E,” and prioritized oncology and diagnostic pathology journals.
Figure and subpanel extraction: We used YOLOv11 to detect and crop multi-panel figures into individual sub-images [11].
Image-text alignment: Unlike CLIP-based methods that score image-caption similarity, we assigned unique IDs during extraction, ensuring 100% alignment accuracy between images and full captions at the article level [9].
Caption splitting: We used GPT based model to break down long captions into panel-specific descriptions and reformat them into Q&A and instructional forms.
Cleaning and filtering: We removed irrelevant text (e.g., citations, abbreviations) and kept only clinically useful content. Low-quality or off-topic pairs were filtered out based on length and keyword relevance.

Other Sources: A Shared Curation Logic

We followed a similar structure across all data sources with a few adaptations:

Online forums: Since these already contain well-described cases, we skipped cropping and alignment. Instead, we focused on cleaning the language, formatting Q&A, and removing irrelevant replies.
Existing datasets: For sources like QUILT, LAION and Multi-conversation, we filtered out non-histopathology samples using keywords and removed hallucinated or generic captions (e.g., unrelated mentions of age or imaging modality) [7, 10].

At every stage, we applied filtering rules to eliminate non-relevant modalities such as CT, MRI, and ultrasound, as well as animal model studies, ensuring we retained only human histology images with diagnostic value. Most of the data we used was already anonymized due to its origin from indirect sources and open-source licensing. For other sources, we ensured that appropriate anonymization was applied before processing.

A VLM is only as good as the data it learns from. Our multi-source, instruction-focused approach enables PAiChat to learn not only what is in a medical slide, but also how a pathologist would describe it, question it, and reason through it.

Practical Application at PAICON

How This Enabled PAiChat

The curated dataset became the foundation of PAiChat, a visual language assistant fine-tuned to:

Understand diagnostic visual features in histopathology images
Respond with grounded medical reasoning instead of relying on generic language patterns
Assist pathologists on online medical forums, supporting faster and more accurate decision-making

Conclusion

While large language models are powerful, their success in medicine depends on grounding, and that grounding starts with high-quality data. At PAICON, we focused on building a carefully curated and medically relevant dataset to train PAiChat, our visual-language assistant for histopathology.

This work lays the foundation for a tool that not only sees what pathologists see but also understands the language they use. PAiChat is designed to support second opinions, accelerate diagnostic workflows, and build trust in AI-assisted clinical decision-making.

Want to explore PAiChat in depth?

In an earlier blog, we shared how we developed PAiChat, our visual-language assistant built specifically for pathologists, and what sets it apart in supporting clinical diagnostics.

Discover PAiChat

References

Liu, X., Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., & Gao, J. (2024). LLaVA-Med: Training a Strong Medical Visual Language Assistant with Limited Data. arXiv preprint arXiv:2402.00838. https://arxiv.org/abs/2402.00838
Dai, D., Zhang, Y., Xu, L., Yang, Q., Shen, X., Xia, S., & Wang, G. (2024). PA-LLAVA: A Large Language-Vision Assistant for Human Pathology Image Understanding. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 3138–3143). https://doi.org/10.1109/BIBM62325.2024.10821785
Seyfioglu, M. S., Ikezogwo, W. O., Ghezloo, F., Krishna, R., & Shapiro, L. (2023). Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos. arXiv preprint arXiv:2312.00000. https://arxiv.org/abs/2312.00000
Sun, Y., Zhang, Y., Si, Y., Zhu, C., Shui, Z., Zhang, K., Li, J., Lyu, X., Lin, T., & Yang, L. (2024). PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration. arXiv preprint arXiv:2407.00203. https://arxiv.org/abs/2407.00203
Zhao, C., Zhu, C., Lin, H., et al. (2023). Conch: A Foundation Model for Pathology. Nature Medicine, 29, 1771–1781. https://www.nature.com/articles/s41591-023-02504-3
Zhang, Y., Zhang, S., Xie, X., et al. (2023). PLIP: Pathology Language-Image Pretraining for Visual Recognition. arXiv preprint arXiv:2307.12914. https://arxiv.org/pdf/2307.12914
Schuhmann, C., Beaumont, R., Vencu, R., et al. (2022). LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402. https://arxiv.org/pdf/2210.08402
Xiong, A. (2023). PMC-OA: Open biomedical image-text dataset. Hugging Face. https://huggingface.co/datasets/axiong/pmc_oa
Abid, A., Ren, S., Krishnan, R., et al. (2023). PMC-CLIP: Contrastive pretraining on biomedical literature. arXiv preprint arXiv:2303.07240. https://arxiv.org/abs/2303.07240
Biswas, T., Ghosal, S., & Panda, R. (2023). Quilt-1M: A multimodal dataset for pretraining biomedical vision-language models. arXiv preprint arXiv:2306.11207. https://arxiv.org/abs/2306.11207
Jocher, G., Qiu, J., & Chaurasia, A. (2023). Ultralytics YOLO (Version 8.0.0) [Computer software]. https://github.com/ultralytics/ultralytics