Stanza: A Multi-lingual Multi-domain Python Natural Language Processing Toolkit

The growing availability of open-source natural language processing (NLP) toolkits has made it easier for practitioners to build tools with sophisticated linguistic processing, and for researchers to make scientific discoveries on natural language understanding.

In this talk, I will introduce Stanza, our latest Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.

I will talk about Stanza’s neural architectural design, its simple user interface, and its improved performance against existing toolkits over a range of 112 datasets covering 66 languages. Next, I will talk about our recent efforts on extending Stanza’s language processing capabilities to the biomedical and clinical domains. With Stanza’s latest release, it now offers native support for accurate syntactic analysis and named entity recognition for biomedical literature text and clinical notes. I will introduce how these extensions are made and the performance of these models on standard biomedical NLP benchmarks.

Lastly, I will talk about Stanza’s Python interface to the widely used Stanford CoreNLP library, which extends Stanza’s functionality to an even richer range of tasks. I will close my talk by talking about our future plans for the Stanza library.

About the speaker

Yuhao Zhang

Researcher at Stanford University & Stanza Committer

Yuhao Zhang is a final-year Ph.D. student at Stanford University, and a member of the Stanford NLP Group and the Stanford Center for Artificial Intelligence in Medicine & Imaging (AIMI). He is jointly advised by Prof. Chris Manning in the CS department and Prof. Curtis Langlotz in the Stanford School of Medicine.

Yuhao’s research has focused on extracting knowledge from natural language text and using knowledge to power downstream applications, with a focus on biomedical applications.

His recent research work has focused on text generation and multimodal learning from the text in the medical domain. He is also a co-author of the Stanza NLP library and leads the efforts in extending Stanza’s functionality to more than 60 human languages and to the biomedical domain.

Presented by