Emma Strubell

I am an Assistant Professor in the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University and a part-time Research Scientist at Google Research.

I earned my Ph.D. from UMass Amherst working in the Information Extraction and Synthesis Laboratory with Andrew McCallum. Previously, I earned a B.S. in Computer Science from the University of Maine with a minor in math, where I applied models from mathematical biology to the spread of internet worms with Professor David Hiebeler in his Spatial Population Ecological and Epidemiological Dynamics Lab. I've also spent time as an intern and visiting researcher at Amazon, IBM, Google and Facebook AI Research.

Research Interests

I am interested in developing new machine learning techniques to facilitate fast and robust natural language processing.

Core natural language processing (NLP) tasks such as part-of-speech tagging, syntactic parsing and entity recognition have come of age thanks to advances in machine learning. For example, the task of semantic role labeling (annotating who did what to whom) has seen nearly 40% error reduction over the past decade. NLP has reached a level of maturity long-awaited by domain experts who wish to leverage natural language analysis to inform better decisions and effect social change. By deploying these systems at scale on billions of documents across many domains practitioners can consolidate raw text into structured, actionable data. These cornerstone NLP tasks are also crucial building blocks to higher-level natural language understanding (NLU) that our field has yet to accomplish, such as whole-document understanding and human-level dialog.

In order for NLP to effectively process raw text across many domains, we require models that are both robust to different styles of text and computationally efficient. The success described above has been achieved in those limited domains for which we have expensive annotated data; models that obtain state-of-the-art accuracy in these data-rich settings are typically neither trained nor evaluated for accuracy out-of-domain. Users also have practical concerns about model responsiveness, turnaround time in large-scale analysis, electricity costs, and consequently environmental conservation, but the highest accuracy systems also have high computational demand. As hardware advances, NLP researchers tend to increase model complexity in step.

My research enables a diversity of domain experts to leverage NLU at large scale with the goal of informing decision-making and practical solutions to far-reaching problems. Towards this end, I provide fundamental advances in computational efficiency and robustness. To facilitate computational efficiency I design new training and inference algorithms cognizant of strengths in the latest tensor processing hardware, and eliminate redundant computation through joint modeling across many tasks. I will enable high accuracy across diverse natural language domains by developing joint models where parameter sharing improves generalization, paired with novel methods for adversarial training that will enable transfer to new domains and languages without labeled data. I will apply my research broadly to low-level NLP as well as high-level NLU tasks. In conjunction with these new machine learning techniques, I will collaborate with domain experts to make a positive mark on society.










In my spare time, I enjoy cooking (with a focus on making vegetables delicious), fermenting (kombucha, kimchi, yogurt, sourdough), enjoying the outdoors (backpacking and rock climbing), and training my dog.

In search of a fast Scala lexer, I forked JFlex and added the ability to emit Scala code. JFlex-scala, and its corresponding maven and sbt plugins, are available on Maven Central. For an example of its use, check out the tokenizer in FACTORIE.

I am also co-author of Plant Jones. He is a semi-intelligent plant who tweets negatively about water when he's thirsty, and positively when he's not. His code is available here.

In my junior year of college I wrote and presented a tutorial on quantum algorithms aimed for undergraduate students in computer science, available here, along with slides part 1 and part 2.

Gentoo Linux user since 2005.

Pittsburgh, PA, USA


strubell [at] cmu [dot] edu

curriculum vitae (PDF)