Publications

Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation

Published in Bioinfomatics, 2024

Motivation: Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process. Results: We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies. This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge.

Recommended citation: Jiyu Chen, Benjamin Goudey, Nicholas Geard, Karin Verspoor, Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation, Bioinformatics, Volume 40, Issue Supplement_1, July 2024, Pages i390–i400 https://doi.org/10.1093/bioinformatics/btae246

Exploring Instructive Prompts for Large Language Models in the Extraction of Evidence for Supporting Assigned Suicidal Risk Levels

Published in Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), pages 197–202, St. Julians, Malta. Association for Computational Linguistics., 2024

Monitoring and predicting the expression of suicidal risk in individuals’ social media posts is a central focus in clinical NLP. Yet, existing approaches frequently lack a crucial explainability component necessary for extracting evidence related to an individual’s mental health state. We describe the CSIRO Data61 team’s evidence extraction system submitted to the CLPsych 2024 shared task. The task aims to investigate the zero-shot capabilities of open-source LLM in extracting evidence regarding an individual’s assigned suicide risk level from social media discourse. The results are assessed against ground truth evidence annotated by psychological experts, with an achieved recall-oriented BERTScore of 0.919. Our findings suggest that LLMs showcase strong feasibility in the extraction of information supporting the evaluation of suicidal risk in social media discourse. Opportunities for refinement exist, notably in crafting concise and effective instructions to guide the extraction process.

Recommended citation: Jiyu Chen, Vincent Nguyen, Xiang Dai, Diego Molla-Aliod, Cecile Paris, and Sarvnaz Karimi. 2024. Exploring Instructive Prompts for Large Language Models in the Extraction of Evidence for Supporting Assigned Suicidal Risk Levels. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), pages 197–202, St. Julians, Malta. Association for Computational Linguistics. https://aclanthology.org/2024.clpsych-1.17.pdf

Exploring Automatic Inconsistency Detection for Literature-based Gene Ontology Annotation

Published in Bioinfomatics, 2022

Motivation: Literature-based Gene Ontology Annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This paper presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results: We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. Conclusion: This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows.

Recommended citation: Jiyu Chen, Benjamin Goudey, Justin Zobel, Nicholas Geard, Karin Verspoor, Exploring automatic inconsistency detection for literature-based gene ontology annotation, Bioinformatics, Volume 38, Issue Supplement_1, July 2022, Pages i273–i281 https://doi.org/10.1093/bioinformatics/btac230

Automatic Consistency Assurance for Literature-based Gene Ontology Annotation

Published in BMC Bioinformatics 22:565, 2021

Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Download here

A bag-of-concepts model improves relation extraction in a narrow knowledge domain with limited data.

Published in In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Student Research Workshop (NAACL-HLT SRW), Minneapolis, 2019, 2019

This paper focuses on a traditional relation ex- traction task in the context of limited annotated data and a narrow knowledge domain. We explore this task with a clinical corpus consisting of 200 breast cancer follow-up treatment letters in which 16 distinct types of relations are annotated. We experiment with an approach to extracting typed relations called window-bounded co-occurrence (WBC), which uses an adjustable context window around entity mentions of a relevant type, and compare its performance with a more typical intra-sentential co-occurrence baseline. We further introduce a new bag-of-concepts (BoC) approach to feature engineering based on the state-of-the-art word embeddings and word synonyms. We demonstrate the competitiveness of BoC by comparing with methods of higher complexity, and explore its effectiveness on this small dataset.

Recommended citation: Jiyu Chen, Karin Verspoor, and Zenan Zhai. "A Bag-of-concepts Model Improves Relation Extraction in a Narrow Knowledge Domain with Limited Data." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 2019. https://www.aclweb.org/anthology/N19-3007.pdf