SightSound-R1
Cross-modal reasoning distillation from vision to audio-language models.
SightSound-R1 investigates how to transfer the rich chain-of-thought reasoning ability of vision-language foundation models into audio-language models. We distill the reasoning traces of large multimodal LLMs and align them with acoustic evidence so that speech agents can explain what they hear.
Highlights
- Aligns textual rationales with spectrogram and waveform cues using multi-stage contrastive objectives.
- Demonstrates large gains on auditory question answering, Foley sound understanding, and spoken commonsense reasoning benchmarks.
- Releases an evaluation protocol and benchmarking suite that measures step-by-step reasoning quality for sound.
Role
- Contributed to the design of the cross-modal distillation pipeline and reasoning-trace data curation.
- Developed curriculum schedules that progressively unlock higher-order reasoning skills for audio agents.
- Ran ablation studies and qualitative analyses showcased in the preprint.
Resources
- Paper: arXiv 2509.15661
- GitHub: (coming soon)