The vast and fast-growing STEM literature makes it imperative to develop systems for automated math-semantics extraction from technical content, and for semantically-enabled processing of such content. Grammar-based techniques alone are inadequate for the task. We present a new project for using deep learning (DL) for that purpose. It will explore a number of DL and representation-learning models, which have shown superior performance in applications that involve sequences of data. As math and science involve sequences of text, symbols and equations, such as deep learning models are expected to deliver good performance in math-semantics extraction and processing. The project has several goals: (1) to apply different DL models to math-semantics extraction and processing, designing more suitable models as needed, for such foundational tasks as accurate tagging and automated translation from LATEX to semantically-resolved machine understandable forms such as cMathML; (2) to create and make available to the public labeled math-content datasets for model training and testing, and Word2Vec/Math2Vec representations derived from large math datasets; and (3) to conduct extensive comparative performance evaluations gaining insights into which DL models, data representations, and traditional machine learning models, are best for the above-stated goals.
展开▼