Eric Battenberg

Software Engineer

Google Research

Biography

I joined Google Research in 2017, where I am a part of the Sound Understanding team within Machine Perception. I am passionate about the potential for machine perception research to make our interactions with technology more natural and seamless, rather than distracting and addictive.

Previously, I was a Research Scientist at the Baidu Silicon Valley Artificial Intelligence Lab (SVAIL) led by Adam Coates and Andrew Ng. At Baidu, I had the privilege of contributing to Deep Speech 2, a revolutionary end-to-end neural speech recognition system. Before that, I developed algorithms for audio event detection and music mood classification at Gracenote in Emeryville, CA.

I received my PhD in Electrical Engineering and Computer Sciences from UC Berkeley, where I worked on signal processing and machine learning techniques for music and audio applications as a member of the Parallel Computing Laboratory (Par Lab). For my thesis work, I developed a system for machine understanding of drum performances.

At Berkeley, I was advised by David Wessel at the Center for New Music and Audio Technologies (CNMAT) and co-advised by Nelson Morgan at the International Computer Science Institute (ICSI).

Interests

Speech Synthesis
Generative Modeling
Machine Perception
Speech and Language Understanding
Deep Learning / Neural Networks
Audio Signal Processing
Parallel Computing

Education

PhD in Electrical Engineering and Computer Sciences, 2012

University of California, Berkeley
MS in Electrical Engineering and Computer Sciences, 2008

University of California, Berkeley
BS in Electrical Engineering, 2005

University of California, Santa Barbara

Projects

End-to-End Speech Synthesis

At Google, I am now a member of the team that brought you Tacotron, an end-to-end speech synthesis system that uses neural networks to convert text directly to audio.

End-to-End Speech Recognition

Baidu’s Deep Speech system does away with the complicated traditional speech recognition pipeline, replacing it instead with a large neural network that is trained in an end-to-end fashion to convert audio into text.

Automatic Drum Understanding

Can we teach computers to listen to drum performances the way humans do? (This was the focus of my PhD thesis.)

Parallel Computing for Music and Audio

As a member of UC Berkeley’s Par Lab, I did a variety of projects focused on improving the computational efficiency of music and audio applications.

Open Source Contributions

Some projects I contributed to.

Selected Publications

All Publications

Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, Tom Bagby

October 2019 ICASSP

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

Using location-relative attention mechanisms allow Tacotron-based TTS models to generalize to very long utterances.

Preprint PDF Project Ref Audio Examples

Eric Battenberg, Soroosh Mariooryad, Daisy Stanton, RJ Skerry-Ryan, Matt Shannon, David Kao, Tom Bagby

June 2019 arXiv

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

We introduce techniques that increase the versatility of variational models of speech, allowing the same model to perform well on multiple tasks, including prosody and style transfer.

Preprint PDF Project Audio Examples

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

July 2018 ICML

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

This work adds a prosody encoder to the Tacotron text-to-speech system that enables the reproduction of the intonation, stress, and rhythm of any spoken utterance.

Preprint PDF Project Poster Slides Video Ref Audio Examples Blog Post

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Vaino Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, Zhenyao Zhu

June 2016 ICML

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

This paper describes work done at Baidu’s Silicon Valley AI Lab to train end-to-end deep recurrent neural networks for both English and Mandarin speech recognition.

Preprint PDF Project Slides Ref

Eric Battenberg, David Wessel

May 2009 ISMIR

Accelerating Non-Negative Matrix Factorization for Audio Source Separation on Multi-Core and Many-Core Architectures

OpenMP and CUDA implementations of NMF to speed up drum track extraction.

PDF Code Project Project Poster