Scaling and Accelerating GPT2 Inference in Kubernetes with ONNX, Triton and Seldon

Identifying the right tools for high performant production machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed.

In this talk, we aim to provide a hands-on guide on how practitioners can productionize optimized machine learning models in cloud-native ecosystems. We will dive into a practical use-case, deploying the renowned GPT-2 NLP machine learning model in Kubernetes leveraging the ONNX Runtime from the Seldon Core Triton server, which will provide us with a scalable production NLP microservice serving the ML model that can power intelligent text generation applications. We will showcase the foundational concepts and best practices to consider when leveraging Kubernetes for production NLP & machine learning inference at scale.

We will present some of the key challenges currently being faced in the MLOps space, as well as how each of the tools in the stack interoperates throughout the production machine learning lifecycle. Namely, we will introduce the benefits that the ONNX Open Standard and Runtime brings, as well as how we are able to leverage the optimized triton server and the orchestration framework Seldon Core to achieve a robust production machine learning deployment that can scale to your growing team / organizational needs. By the end of this talk, attendees will have a better understanding of how they will be able to leverage these tools for their own models, as well as for the broad range of pre-trained models available.

We will also provide a broad range of links and resources that will allow attendees to dive deeper into detailed areas, such as observability, scalability, governance, etc.

About the speaker

Alejandro Saucedo

Engineering Director (Machine Learning) at Seldon Technologies

Alejandro is the Chief Scientist at the Institute for Ethical AI & Machine Learning, where he leads the development of industry standards on machine learning explainability, adversarial robustness, and differential privacy.

Alejandro is also the Director of Machine Learning Engineering at Seldon Technologies, where he leads large-scale projects implementing open source and enterprise infrastructure for Machine Learning Orchestration and Explainability.

With over 10 years of software development experience, Alejandro has held technical leadership positions across hyper-growth scale-ups and has a strong track record building cross-functional teams of software engineers.



Sessions: October 5 – 7
Trainings: October 4, 12 – 15


Presented by