CV
Download a detailed curriculum vitae and explore highlights across education, research, and creative projects.
General Information
| Full Name | Qiaolin Wang |
| qw2443@columbia.edu | |
| Phone | (646) 528-0549 |
| Location | New York, NY |
Research Interests
| Focus | I enjoy studying models that can Perceive, Reason, and Speak like humans |
| Areas | Multimodal Large Language Models (MLLMs), Audio-Visual Understanding, Speech Synthesis |
Education
-
2024 - 2025 M.S. in Electrical Engineering
Columbia University, New York, NY - Concentration: Speech and Language Processing
- Advisor: Prof. Nima Mesgarani
-
2020 - 2024 B.Eng. in Computer Science
Wuhan University, Wuhan, China - Core Course: Computer Systems, Artificial Intelligence, Intelligent Speech Processing
Publications
-
2025 Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations
- Linyang He*, Qiaolin Wang*, Xilin Jiang, and Nima Mesgarani
- EMNLP 2025 SAC Highlight
-
2025 SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
- Qiaolin Wang, Xilin Jiang, Linyang He, Junkai Wu, and Nima Mesgarani
- Submitted to ICASSP 2026
- arXiv
Research Experience
-
2025 AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Multimedia Knowledge and Thinking Beyond Text
Columbia University, New York, NY - Built a large-scale audio-visual meme benchmark to test MLLMs' multimodal and cultural understanding
- Spearheaded evaluation of 15 models (Audio, Video, Omni, and commercial LLMs) with fine-grained QA categorization across 1000 memes in multiple languages and cultures
- Aiming for public release and preprint submission by Dec 2025
-
2025 SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
Columbia University, New York, NY - Proposed SightSound-R1, a novel framework to distill reasoning from Vision to Audio LLMs
- Engineered an audio-focused Chain of Thought (CoT) generation prompt for Qwen2.5-VL-32B with test-time scaling and used a GPT-4o fact-checker to filter visual hallucinations
- Implemented a two-stage training strategy (SFT + GRPO) to distill the verified CoT into Qwen2-Audio-7B
- Improved the LALM's reasoning on unseen datasets, outperforming baselines and achieving 66.1% on MMAU-Test-Mini (Sound) and 59.5% on MUSIC-AVQA
-
2025 Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations
Columbia University, New York, NY - Conducted speech minimal-pair probing with 116k pairs from BLiMP/COMPS across 71 linguistic tasks
- Probed 16 models (S3M, ASR, AudioLLM, Codec) with layer-wise linear classifiers on frozen representations
- Observed syntax > morphology > concepts: Speech models capture form more strongly than meaning
- Found mean pooling outperformed single-token extraction, yielding more stable speech representations
- Exposed temporal asymmetry: grammatical evidence in S3M/S2T-ASR peaks 500-600ms pre-onset, whereas in AudioLLMs and Whisper it accumulates through onset and beyond
-
2023 Snoring Sound Dataset Annotation
Wuhan University, Wuhan, China - Analyzed Polysomnography (PSG) data from 40 patients to identify respiratory events and sleep stages
- Synchronized clinical PSG signals with ≈170 hours of audio using PSG4/Sleepware and Audition
- Labeled snores relative to respiratory events and cross-referencing sleep stages, categorizing snores based on their temporal relation to events
- Contributed to the foundational dataset published at Interspeech 2023 (PDF)
Work Experience
-
2024 Research Engineer Intern
Wiz.AI, Singapore - Developed a multi-task Speech Large Language Model to understand both content and emotion
- Engineered a prompt strategy that integrates dialogue history to enhance Chain of Thought reasoning
- Utilized a window-level query mechanism to capture fine-grained emotional features from raw speech
- Finetuned a cross-modal alignment module between speech (Q-former) and text (Vicuna LLM)
- Implemented a contrastive-learning loss on emotion embedding, achieving SOTA 74.48% on IEMOCAP and 62.61% on MELD for SER
-
2023 Machine Learning Engineer Intern
Wiz.AI, Singapore - Spearheaded a systematic evaluation of voice cloning models and automated the deployment pipeline
- Benchmarked open and closed-source voice-cloning solutions on speech quality and naturalness
- Fine-tuned a StyleTTS2-based voice cloning model on LibriSpeech and proprietary dataset
- Built a speaker verification framework with WeSpeaker, PyAnnote, and SpeechBrain for evaluation
- Achieved competitive speaker similarity (≈0.84) to the commercial model using only 245 hours of data (1/40 of the original dataset)
Projects
-
2022 Speech Synthesis Implementation on Game Avatars
Wuhan University, Wuhan, China - Developed a high-fidelity, Text-to-Speech (TTS) model for the character "Paimon" from Genshin Impact
- Engineered a pipeline using ECAPA-TDNN for speaker classification and Whisper for transcription
- Built and annotated a multi-speaker dataset of ≈48000 clips (15 hrs.) from 50 Genshin Impact characters
- Finetuned a VITS-based speech synthesis model using a curated set of "Paimon" audio clips
- Launched the project as a technical demo on Bilibili, attracting over 600,000 views and deploying the model on Google Colab for public inference
- Colab | Bilibili
Technologies
- Languages: Python, Java, C, C++
- Tools: Linux, Git, PyTorch