SightSound-R1

Cross-modal reasoning distillation from vision to audio-language models.

SightSound-R1 investigates how to transfer the rich chain-of-thought reasoning ability of vision-language foundation models into audio-language models. We distill the reasoning traces of large multimodal LLMs and align them with acoustic evidence so that speech agents can explain what they hear.

Highlights

  • Aligns textual rationales with spectrogram and waveform cues using multi-stage contrastive objectives.
  • Demonstrates large gains on auditory question answering, Foley sound understanding, and spoken commonsense reasoning benchmarks.
  • Releases an evaluation protocol and benchmarking suite that measures step-by-step reasoning quality for sound.

Role

  • Contributed to the design of the cross-modal distillation pipeline and reasoning-trace data curation.
  • Developed curriculum schedules that progressively unlock higher-order reasoning skills for audio agents.
  • Ran ablation studies and qualitative analyses showcased in the preprint.

Resources

References