- Analytics Meta Learning
- Biomedical Question Answering System
- Procedural Knowledge Discovery
- Heterogeneous Ranking for ArnetMiner
- Large-Scale Social & Text Mining
Analytics Meta Learning
tl;dr budgeted policy optimization in large configuration space for arbitrary information systems
Shortly after IBM Watson’s victory on Jeopardy! and an initiative of adapting and scaling its technology to new domains, I joined the Open Advancement Question Answering (OAQA) team in Language Technologies Institute of Carnegie Mellon University working with Prof. Eric Nyberg, and chose as my research topic to design an intelligent software architecture that meta-learns complex problems, as hard as question answering or decision support automatically in hours, as opposed to 100+ man-years, incorporating the optimization strategies in reinforcement learning and budgeted multi-armed bandit problems.
- Defining the analytics meta learning (AML) problem, aiming to rapidly and automatically design, construct, and optimize any arbitrary complex intelligent information system, such as a question answering (QA) system or a decision support system, for any given analytics task, and proposing 102 (budgeted) policy optimization strategies for analytics space exploration MDP. [paper on preliminary work] [thesis]
- Developing extensions to the Apache UIMA Framework – the CSE Framework [code] and AML Framework [code soon], and utilizing the frameworks to build systems that beat the winning systems in real-world challenge problems from over decillions (1035) of component combinations and configurations.
- Designing and implementing a general-domain QA architecture framework – BaseQA [code].
- Proposing and formalizing a novel decision support architecture framework – QUADS, which extends existing question answering systems to understand natural language decision scenarios and automate human decision processes to make predictions of unknown facts. [paper]
Biomedical Question Answering System
tl;dr a BioASQ Challenges winning biomedical question answering system beyond factoid questions
Lovely Trophies, circa 2017. Zi Yang
Healthcare is a priority when delivering technological innovation to improve people’s lives. I was very fortunate to have the opportunities (fellowship + internship + technical consultancy) allowing me to very closely work with Roche Translational & Clinical Research Center to design innovative software architectures and algorithms in NLP, IR, and ML, and create various information systems in biomedicine. Most notably, our system won the exact question answering task of the 2015 and 2016 iterations of the Biomedical Semantic Indexing and Question Answering (BioASQ) Challenge, and has been used as the baseline system for subsequent competitions.
- Extending the BaseQA architecture to develop an open-source biomedical QA system – OAQA BioASQ system [code], where we developed learning modules to accomplish biomedical question answering subtasks, including document and passage ranking, answer type prediction, candidate answer extraction, candidate answer scoring, collective answer reranking, yes/no answer prediction, and answer text summarization.
- Leading development teams consisting of students in multiple master programs across academic semesters, including MLT, MCDS, MSBIC), to participate in 2015 and 2016 BioASQ Challenges (Large-scale Biomedical Question Answering), and won the #1 place in the exact answer extraction task two years in a row [scs news].
- Adapting the general-purpose biomedical question answering system to target validation, a preclinical stage in drug discovery, for enterprise use and internal deployment at Roche. We further extended the QUADS framework to develop a drug target validation support system. [paper]
Procedural Knowledge Discovery
tl;dr up to 41% improvement over Google/Bing in task-oriented search tasks, achieved by leveraging procedural knowledge
While descriptive knowledge (such as FreeBase) has demonstrated improved quality in entity-centric search, procedural knowledge (such as wikiHow) has not until recently been incorporated to assist task-oriented search, due to lack of sufficient coverage and difficulty in understanding the subtask descriptions. During my fellowship term in the Yahoo InMind project, my research focused on bridging the gap between the procedural knowledge discovery and task-oriented search. This research can assist and accelerate the manual process of scripting task procedures, and allow rapid development of personal assistant services and task-oriented dialog systems.
- Proposing to utilize procedural (actionable) knowledge in semi-structured procedural knowledge base (e.g. wikiHow) for task understanding and suggestion. The solution significantly outperforms the major commercial search engines by up to 41% [code].
- Proposing to discover procedural knowledge from user generated data, e.g. search query logs, community QA sites, or Web documents. The solution unveils alternative and supplementary actions to those in the existing procedural knowledge bases [code].
Heterogeneous Ranking for ArnetMiner
tl;dr ranking and recommending experts, publications, conferences simultaneously, and a full-stack developer of early ArnetMiner system
I joined the Knowledge Engineering Group of Tsinghua University in 2007, initially as an undergraduate SRT (Student Research Training) student and then as a master student, working with Prof. Jie Tang on the ArnetMiner project. My research focused on retrieval and ranking algorithm over heterogeneous network (experts, publications, conferences). I also took part in the development of early versions (2.0+) of the system, from off-line information extraction modules, such as Perl(!) scripts for profiling, to on-line interactive search frontend components, including social graph visualization using ActionScript/Flash(!) (a snapshot can still be seen in our paper published in 2009, a few years before jQuery and other JS frameworks took over Flash).
- Proposing a ranking method over heterogeneous social networks, which can jointly rank experts, publications, conferences, utilizing the information of the heterogeneous network, including researcher publishing publications, attending conferences, coauthoring, etc. [paper]
- Developing the initial version of an academic advisor search service on ArnetMiner – Bole Search, where we proposed to partially transfer the knowledge used for generic expert search to ranking Boles, referred to as heterogeneous cross domain ranking. [papers 1, 2]
- Proposing a topic-level search and recommendation framework, where the topics of a search corpus are automatically learned. [paper]
- Designing a graph search algorithm (such as searching relations in a social graph) and implementing a Flash-based graphical user interface for visualizing the graph search results.
Large-Scale Social & Text Mining
tl;dr the earliest research on predicting who, when, and whether one will retweet my tweet, and distributing on Hadoop
Social networks started to become an integral part of our lives after 2005, making it more urgent than ever for the research community to understand and predict the behaviors of social network users. We proposed a series of factor graphs, due to their isomorphism to the social networks in the representation and inference (such as message passing vs social cascade), for various social prediction problem. Most notably, we published the earliest work on predicting who, when, and whether one will retweet my tweet. As a local Hadoop expert, I was also largely involved in Hadoop-ifying other social and text mining algorithms, such as Gibbs-sampling implementation of LDA. A large body of the work was done in collaboration with IBM Research - China (joint research project + internship).
- Proposing a series of factor graph models for various prediction tasks in social networks and social media, e.g. community detection, retweeting prediction, social context summarization, etc., and implementing distributed algorithms in MPI and Hadoop frameworks.
- Developing the earliest Gibbs-sampling implementation on Hadoop for large-scale LDA training and inference.
- Proposing an information compression algorithm for preprocessing large-scale document collections at an acceptable information loss level.
- Designing a predictive model for patent renewal recommendation using IBM’s patent renewal records.