My research interests include Natural Language Processing (NLP), Machine Learning (ML), and Cognitive Science, with an emphasis on advances and applications relevant to Educational Technology (such as Classroom Engagement Technology—Comprehension SEEDING, Intelligent Tutoring Systems, Question Generation, Fine-Grained Response Assessment, and tools to foster Computational Thinking skills), Health & Clinical Informatics (Clinical Question Answering, Clincial Data Mining), and the confluence of Educational Technology with Health and Wellbeing Technology (Perceptive Emotive Spoken-Dialogue Companion Robots—Companionbots). One of the common threads running through all of my work is advancing learning – human and machine.
The advancement of NLP and ML is central to my research. I am interested in both ML theory and application. My past research includes methods to improve the predictions of concept or class probability estimates and I am furthering this research to make advances in interactive, semi-supervised, and active learning from large unlabeled corpora.
One of the key open questions in many applications of ML, which is particularly true of NLP applications, is how to learn effectively from the vast quantities of unlabeled data available from high bandwidth input streams and from massive data sources, such as the web. This consists of two important broad research questions, which I am investigating, the first addressing learning from massive datasets (big data) and the second addressing learning from unlabeled data.
A couple of other advances in NLP and ML algorithms I am pursuing are a new unsupervised soft-clustering algorithm and interactive (human-in-the-loop) learning algorithms. I believe all of these ideas have the potential to facilitate significant advancements in the NLP required for spoken dialogue companionable robots, clinical informatics, educational technology, end-user software engineering and other applications.
My primary research focus is on computational semantics and pragmatics models to facilitate machine understanding of text and spoken dialogue. This includes generating semantic representations (semantic facets, concept relations, predicate argument structure, discourse relations, etc.), extracting lexical and conceptual relations from distributional statistics of large corpora, and recognizing presupposition, implicature and entailment.
In the following sections, I describe some of the applied research where collaborators and I aim to utilize this basic research.
Collaborators and I are conducting research to help instructors assess student knowledge and skills in real-time (Comprehension SEEDING; Nielsen PI, IES $1.83M 2011-2016 with ASU and UCD). Students submit free text responses to instructors' open-ended questions via mobile devices to an NLP system that clusters the answers and provides the instructor feedback on the types of misconceptions and their frequency, among other things. Unlike clicker technology, students must articulate their understanding of a concept, which has been shown by numerous cognitive science researchers to be a key to deep learning. This research will benefit from aspects of my Ph.D. work, which was the first research to successfully assess elementary students' one- to two-sentence constructed response answers.
In the context of a known reference answer to a tutor's question, I extract a knowledge representation of the fine-grained facets of the reference answer and classify each according to whether you can infer from a student response that they understood the facet, contradicted it, left it unaddressed, or expressed something related that is perhaps a misconception. The goal of this fine-grained analysis, classifying more precisely the student's apparent understanding of detailed facets, is to facilitate improved pedagogical dialogue and eventually Socratic tutoring. To that end, I am also researching automatic question generation and question answering.
To support this work, I had a corpus annotated to indicate elementary school students' apparent understanding of a broad spectrum of science concepts. This corpus, comprised of 15,357 student responses and 142,451 facet annotations for questions from 16 different science areas, can be downloaded from my Resources page.
I am also applying my NLP research and software engineering experience to the automatic extraction of semantic representations of software requirements. Specifically, on one project, we are extracting access control policies from software requirements documentation, and on another, we are identifying components, states, and their transitions within the natural language in requirements documents. My goal is to eventually develop tools that can interact naturally with end-users to develop custom software, and in the education setting, to teach computational thinking skills to students outside of computer science.
In work with Harvard Medical School and Mayo Clinic (MiPACQ; Guergana Savova PI, NIH ARRA $1M 2009-2011), we researched the use of statistical computational semantics in clinical question answering (CQA) and achieved state-of-the-art results. Specifically, we annotated a large corpus of clinical notes, biomedical encyclopedic text, and clinical questions with syntactic and semantic information such as the semantic relations between predicates and their arguments, unified medical language system (UMLS) entities and relations, and expected answer types and trained classifiers to automatically parse and annotate questions and text with this information. Then given a question, we use information retrieval tools to find relevant medical articles or clinical notes that might contain the answer, automatically annotate the question and potential answers, and extract syntactic and semantic features from these annotations. Finally, we use a machine-learned re-ranker to identify the paragraph-level results most likely to answer the question.
Within the same framework described above, we also investigated NLP and ML techniques for research cohort identification, identifying patients appropriate to participate in a given clinical trial, based largely on information extracted from unstructured text in the notes of electronic medical records. I have also investigated NLP methods for identifying patients who have depression or suicidal ideation, and those who have heart disease. We are currently classifying the severity of a variety of mental health conditions.
The number of people over 65 in the U.S. will more than double in roughly the first quarter of this century. Many of these elderly would prefer to maintain their independence and remain in their homes. Additionally, many suffer from depression. Collaborators and I are researching means of supporting these seniors via emotive spoken-dialogue companion robots (Companionbots; Nielsen PI, NSF $1.96M total 2011-2016; UNT, CU, DU, UCD Anschutz Medical Campus, and Boulder Language Technologies). The focus of the research is on dialoguing, especially generating and answering questions, in the context of providing education and training related to depression, monitoring participants for signs of physical, mental or emotional deterioration, and being a companion. NLP will capitalize on multimodal input and output, be heavily context dependent and tightly integrated with a user model and history. ML will emphasize co-training on multimodal input and user-assisted semi-supervised learning from massive data sets and data streams. Future work will include massive-scale data mining over the information collected by the Companionbots.
I benefit from a dual Ph.D. in Computer Science and Cognitive Science with studies in psycholinguists and human learning theory. I incorporate findings from these areas throughout my research in computational semantics and pragmatics, machine learning, educational technologies, and behavior-change, health and wellbeing technologies. One of my projects where many of these topics intersect is I Spy. In this project, our systems are learning about language in the context of an interactive guessing game, based on multi-modal input – vision, language, and kinematics. The aim is for a robot to learn to ground language in its visual correlates, and to learn to ask natural question about its environment and incorporate what it learns from the human's response into its understanding of the world, resulting in increasingly improved human-robot interactions.
In summary, the core of my research involves advancing NLP and ML algorithms and methods to facilitate computational semantics and pragmatics. I emphasize learning from massive unlabeled data sources and primarily focus on applications related to education, health and wellbeing.