How to Harness Wikipedia and Wikidata for Entity Recognition
We live in the golden age of open source data for NLP. For free, no strings attached, you have access to millions of Wikipedia articles devoted to every conceivable entity, written in hundreds of languages, all interwoven with Wikidata’s structured knowledge base. But it is surprisingly hard to harness this public treasure for the purpose of training entity recognition models. In this talk I will describe work at Primer on the “Data Factory”, a layer of code on top of Wikipedia and Wikidata that we are using to generate high-quality entity recognition data on demand. I will also share progress on an ambitious use of that data: a true zero-shot entity recognition model. Andfinally, I will introduce BabyBear, the open source framework we created to make entity recognition models run faster and cheaper.
Director of Science Primer.ai
John Bohannon is Director of Science at Primer, an AI startup in San Francisco. Before joining Primer he was a visiting data scientist at the UC Berkeley Institute for Data Science and the Harvard School of Engineering and Applied Science. He is the founder of the annual “Dance Your PhD” contest. His PhD, which he has danced, is from the University of Oxford.