Posts by Collection



A bag-of-concepts model improves relation extraction in a narrow knowledge domain with limited data.

Published in In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Student Research Workshop (NAACL-HLT SRW), Minneapolis, 2019, 2019

This paper focuses on a traditional relation ex- traction task in the context of limited annotated data and a narrow knowledge domain. We explore this task with a clinical corpus consisting of 200 breast cancer follow-up treatment letters in which 16 distinct types of relations are annotated. We experiment with an approach to extracting typed relations called window-bounded co-occurrence (WBC), which uses an adjustable context window around entity mentions of a relevant type, and compare its performance with a more typical intra-sentential co-occurrence baseline. We further introduce a new bag-of-concepts (BoC) approach to feature engineering based on the state-of-the-art word embeddings and word synonyms. We demonstrate the competitiveness of BoC by comparing with methods of higher complexity, and explore its effectiveness on this small dataset.

Recommended citation: Jiyu Chen, Karin Verspoor, and Zenan Zhai. "A Bag-of-concepts Model Improves Relation Extraction in a Narrow Knowledge Domain with Limited Data." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 2019.

Automatic Consistency Assurance for Literature-based Gene Ontology Annotation

Published in BMC Bioinformatics 22:565, 2021

Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Download here

Exploring Automatic Inconsistency Detection for Literature-based Gene Ontology Annotation

Published in Bioinfomatics, 2022

Motivation: Literature-based Gene Ontology Annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This paper presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results: We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. Conclusion: This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows.

Recommended citation: Jiyu Chen, Benjamin Goudey, Justin Zobel, Nicholas Geard, Karin Verspoor, Exploring automatic inconsistency detection for literature-based gene ontology annotation, Bioinformatics, Volume 38, Issue Supplement_1, July 2022, Pages i273–i281

Exploring Instructive Prompts for Large Language Models in the Extraction of Evidence for Supporting Assigned Suicidal Risk Levels

Published in Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), pages 197–202, St. Julians, Malta. Association for Computational Linguistics., 2024

Monitoring and predicting the expression of suicidal risk in individuals’ social media posts is a central focus in clinical NLP. Yet, existing approaches frequently lack a crucial explainability component necessary for extracting evidence related to an individual’s mental health state. We describe the CSIRO Data61 team’s evidence extraction system submitted to the CLPsych 2024 shared task. The task aims to investigate the zero-shot capabilities of open-source LLM in extracting evidence regarding an individual’s assigned suicide risk level from social media discourse. The results are assessed against ground truth evidence annotated by psychological experts, with an achieved recall-oriented BERTScore of 0.919. Our findings suggest that LLMs showcase strong feasibility in the extraction of information supporting the evaluation of suicidal risk in social media discourse. Opportunities for refinement exist, notably in crafting concise and effective instructions to guide the extraction process.

Recommended citation: Jiyu Chen, Vincent Nguyen, Xiang Dai, Diego Molla-Aliod, Cecile Paris, and Sarvnaz Karimi. 2024. Exploring Instructive Prompts for Large Language Models in the Extraction of Evidence for Supporting Assigned Suicidal Risk Levels. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), pages 197–202, St. Julians, Malta. Association for Computational Linguistics.



Software Developer Intern

Internship, Chengdu Research Centre, Huawei Technologies Co., Ltd, 2016

studying and deploying object-oriented programming design pattern; configuring and monitoring on hardware status in Linux system. My team focus on the implementation of SDH communication networks.


Master of Business Analytics Course , Melbourne Business School, University of Melbourne, 2020

Help students gain familiarity with techniques for analyzing textual data. Help students develop an understanding of key algorithms used in NLP and learn to apply them in a diverse range of contexts including search engines, multilingual information retrieval, machine translation, text mining, question answering, summarization, and grammar correction.

Academic Tutor

COMP10001 Foundations of Computing, School of Computing and Information Systems, University of Melbourne, 2022

TA for subject Foundations of Computing COMP10001 at the University of Melbourne. Help students solving problems in areas that often requires manipulating, analyzing, and visualizing data through computer programming. Teach students with little or no background in computer programming how to design and write basic programs using python3, and to solve simple problems using these skills. The delivered content include fundamental programming constructs; data structures; abstraction; basic program structures; algorithmic problem solving; testing and debugging; introduction to the Web, multimedia and visualization.