We are looking for a Gen AI Researcher for Audio to join our team and help develop next-generation voice synthesis models. You'll research and build deep learning systems that can generate expressive, natural-sounding speech from text or audio prompts, and collaborate with cross-functional teams to integrate your work into production-ready pipelines.
Key Responsibilities
-
Research and develop state-of-the-art voice synthesis models (e.g., TTS, voice cloning, speech-to-speech).
-
Build and fine-tune models using frameworks like PyTorch and HuggingFace.
-
Design training pipelines and datasets for scalable voice model training.
-
Explore techniques for emotional expressiveness, multilingual synthesis, and speaker adaptation.
-
Work closely with product and creative teams to ensure models meet quality and production constraints.
-
Stay on top of academic and industrial trends in speech synthesis and related fields.
Must Haves
-
Strong background in machine learning and deep learning, with focus on speech/audio.
-
Hands-on experience with TTS, voice cloning, or related voice synthesis tasks.
-
Proficiency with Python and PyTorch; experience with libraries like torchaudio, ESPnet, or similar.
-
Experience training models at scale and working with large audio datasets.
-
Familiarity with vocoders and transformer-based architectures.
-
Strong problem-solving skills, ability to work autonomously in a remote-first environment.
Nice to Have
-
PhD degree in Computer Science/ Machine Learning and publications in top venues.
-
Contributions to open-source speech research or participation in relevant benchmarks.
-
Familiarity with adjacent areas like lip-syncing, audio-driven animation, or expressive speech control.
-
Experience with voice datasets or proprietary pipelines.