Exploration of automatic relation extraction in narrow knowledge domain with limited data
March 2018 – June 2019
March 2018 – June 2019
March 2018 – June 2018
Feburary 2020 – August 2023
Published in In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Student Research Workshop (NAACL-HLT SRW), Minneapolis, 2019, 2019
This paper focuses on a traditional relation ex- traction task in the context of limited annotated data and a narrow knowledge domain. We explore this task with a clinical corpus consisting of 200 breast cancer follow-up treatment letters in which 16 distinct types of relations are annotated. We experiment with an approach to extracting typed relations called window-bounded co-occurrence (WBC), which uses an adjustable context window around entity mentions of a relevant type, and compare its performance with a more typical intra-sentential co-occurrence baseline. We further introduce a new bag-of-concepts (BoC) approach to feature engineering based on the state-of-the-art word embeddings and word synonyms. We demonstrate the competitiveness of BoC by comparing with methods of higher complexity, and explore its effectiveness on this small dataset.
Recommended citation: Jiyu Chen, Karin Verspoor, and Zenan Zhai. "A Bag-of-concepts Model Improves Relation Extraction in a Narrow Knowledge Domain with Limited Data." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 2019. https://www.aclweb.org/anthology/N19-3007.pdf
Published in BMC Bioinformatics 22:565, 2021
Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Our approach demonstrates clear value for human-in-the-loop curation scenarios.
Download here
Published in Bioinfomatics, 2022
Motivation: Literature-based Gene Ontology Annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This paper presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results: We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. Conclusion: This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows.
Recommended citation: Jiyu Chen, Benjamin Goudey, Justin Zobel, Nicholas Geard, Karin Verspoor, Exploring automatic inconsistency detection for literature-based gene ontology annotation, Bioinformatics, Volume 38, Issue Supplement_1, July 2022, Pages i273–i281 https://doi.org/10.1093/bioinformatics/btac230
Published:
Implementing an intelligent computer for language understanding and information extraction
Published:
Poster presentation: A bag-of-concepts model improves relation extraction in a narrow knowledge domain with limited data More information here
Published:
Abstract presentation: Automatic Consistency Assurance for Literature-based Gene Ontology Annotation More information here
Published:
Publication: Exploring Automatic Inconsistency Detection for Literature-based Gene Ontology Annotation Info
Published:
Internship, Chengdu Research Centre, Huawei Technologies Co., Ltd, 2016
studying and deploying object-oriented programming design pattern; configuring and monitoring on hardware status in Linux system. My team focus on the implementation of SDH communication networks.
Master of Business Analytics Course
Help students gain familiarity with techniques for analyzing textual data. Help students develop an understanding of key algorithms used in NLP and learn to apply them in a diverse range of contexts including search engines, multilingual information retrieval, machine translation, text mining, question answering, summarization, and grammar correction.
COMP10001 Foundations of Computing, School of Computing and Information Systems, University of Melbourne, 2022
TA for subject Foundations of Computing COMP10001 at the University of Melbourne. Help students solving problems in areas that often requires manipulating, analyzing, and visualizing data through computer programming. Teach students with little or no background in computer programming how to design and write basic programs using python3, and to solve simple problems using these skills. The delivered content include fundamental programming constructs; data structures; abstraction; basic program structures; algorithmic problem solving; testing and debugging; introduction to the Web, multimedia and visualization.
Academic Research, CSIRO-Marsfield, 2023
NLP and social media analysis for mental well-being