Selected Publications

Using location-relative attention mechanisms allow Tacotron-based TTS models to generalize to very long utterances.
arXiv, 2019

We introduce techniques that increase the versatility of variational models of speech, allowing the same model to perform well on multiple tasks, including prosody and style transfer.
arXiv, 2019

This work adds a prosody encoder to the Tacotron text-to-speech system that enables the reproduction of the intonation, stress, and rhythm of any spoken utterance.
ICML, 2018

This paper describes work done at Baidu’s Silicon Valley AI Lab to train end-to-end deep recurrent neural networks for both English and Mandarin speech recognition.
ICML, 2016

OpenMP and CUDA implementations of NMF to speed up drum track extraction.
ISMIR, 2009

Recent Publications

More Publications

. Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis. arXiv, 2019.

Preprint PDF Project Audio Examples

. Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis. arXiv, 2019.

Preprint PDF Project Audio Examples

. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. ICML, 2018.

Preprint PDF Project Poster Slides Video Source Audio Examples Blog Post

. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. ICML, 2018.

Preprint PDF Project Source Audio Examples Blog Post

. Uncovering Latent Style Factors for Expressive Speech Synthesis. NIPS ML4Audio Workshop, 2017.

Preprint PDF Project Poster Audio Examples Workshop

. Exploring Neural Transducers for End-to-End Speech Recognition. ASRU, 2017.

Preprint PDF Project Source

. Reducing Bias in Production Speech Models. arXiv, 2017.

Preprint PDF Project

. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. ICML, 2016.

Preprint PDF Project Slides Source

. Lasagne: First Release. GitHub, 2015.

Code Project 0.1 Ref

. LibROSA: Audio and Music Signal Analysis in Python. SciPy, 2015.

PDF Code Project 0.5.0 Ref


End-to-End Speech Synthesis

At Google, I am now a member of the team that brought you Tacotron, an end-to-end speech synthesis system that uses neural networks to convert text directly to audio.

End-to-End Speech Recognition

Baidu’s Deep Speech system does away with the complicated traditional speech recognition pipeline, replacing it instead with a large neural network that is trained in an end-to-end fashion to convert audio into text.

Open Source Contributions

Some projects I contributed to.

Parallel Computing for Music and Audio

As a member of UC Berkeley’s Par Lab, I did a variety of projects focused on improving the computational efficiency of music and audio applications.

Automatic Drum Understanding

Can we teach computers to listen to drum performances the way humans do? (This was my PhD thesis.)