SightSound-R1 | Qiaolin Wang

SightSound-R1 investigates how to transfer the rich chain-of-thought reasoning ability of vision-language foundation models into audio-language models. We distill the reasoning traces of large multimodal LLMs and align them with acoustic evidence so that speech agents can explain what they hear.

Highlights

Aligns textual rationales with spectrogram and waveform cues using multi-stage contrastive objectives.
Demonstrates large gains on auditory question answering, Foley sound understanding, and spoken commonsense reasoning benchmarks.
Releases an evaluation protocol and benchmarking suite that measures step-by-step reasoning quality for sound.

Role

Contributed to the design of the cross-modal distillation pipeline and reasoning-trace data curation.
Developed curriculum schedules that progressively unlock higher-order reasoning skills for audio agents.
Ran ablation studies and qualitative analyses showcased in the preprint.

Resources

Paper: arXiv 2509.15661
GitHub: (coming soon)

Highlights

Role

Resources

References