Along with NJIT researchers, Fuentes is developing AI audio captioning technology for deaf and hard-of-hearing viewers in a three-year project.
Magdalena Fuentes
Magdalena Fuentes, an NYU assistant professor with dual appointments (music technology at NYU Steinhardt and integrated design and media NYU Tandon School of Engineering), is leading a team with New Jersey Institute of Technology (NJIT) colleagues Mark Cartwright and Sooyeon Lee; the team is developing AI systems to automatically caption non-speech sounds in videos, from footsteps and door slams to background music and off-screen crashes.
The technology aims to address a persistent accessibility gap that affects over one billion deaf and hard-of-hearing people worldwide who cannot access crucial audio information when watching online videos.
Fuentes—who serves in Tandon's Technology, Culture and Society Department as well as in Steinhardt's Music and Audio Research Laboratory (MARL)—and her NJIT collaborators just received nearly $800,000 in grants from the National Science Foundation (NSF) for their three-year project.
The team is creating AI models that identify environmental sounds, determine which are important enough to caption, and adapt descriptions to individual viewer preferences. While automatic speech recognition has made dialogue captions common on YouTube, environmental sounds remain largely uncaptioned, creating barriers as viewers may read dialogue but miss crucial audio cues like approaching footsteps in a thriller or boiling water in a cooking tutorial.
The research team's analysis reveals the scope of the challenge: despite over 500 hours of video being uploaded to YouTube every minute, only about 4% of popular videos include captions for non-speech sounds. Their studies show this percentage has declined over the past decade as content creators increasingly rely on speech-only automated captioning tools.
"These sounds are often critical for understanding content, yet they're essentially inaccessible to deaf and hard-of-hearing viewers," said Fuentes, who specializes in audio-visual machine learning. "We're developing AI that can identify, prioritize, and describe environmental sounds in ways that serve users' diverse needs."
The technology goes beyond simply detecting sounds. The system will also adapt caption style and detail level to match user preferences: some viewers may want detailed acoustic descriptions while others prefer brief functional summaries.
The approach recognizes that effective audio captioning cannot be uniform across all users. The team's survey of 168 deaf and hard-of-hearing participants revealed significant diversity in preferences. When researchers analyzed the most popular responses for different types of sounds, they found that creating guidelines based on majority preferences would satisfy zero participants completely, with more than half having fewer than half their needs met.
"This shows us we need adaptive, personalized solutions," said Fuentes. "Our AI needs to understand not just what sounds are present, but why they matter in context and how different users want them described."
The research combines machine learning development with extensive community engagement. The team will conduct surveys and interviews with deaf and hard-of-hearing viewers and content creators, followed by collaborative design workshops to ensure the tools meet actual user needs.
The project has partnerships with Adobe, Google/YouTube, and New York Public Radio, which could facilitate broader adoption of the resulting tools. The researchers plan to release their datasets, algorithms, and software as open-source resources, potentially enabling implementation across the video industry.
The technology could also benefit older adults with hearing difficulties, people in noisy environments, and anyone in situations where audio cannot be used. While primarily designed for accessibility, the tools may find applications in content discovery, media analysis, and automated video production workflows.
The project begins this summer, with initial prototypes expected within two years and a complete platform anticipated by 2028. The work, which had been previously supported with a MARL Seed Award, represents part of ongoing efforts to make digital media more inclusive as video content continues to dominate online communication and entertainment.
Learn more about the NSF grants for "HCC: Medium: AI-Supported Audio Captioning of Non-Speech Information" awarded to:
Related Articles
Cool Course: Global Electronic Music
Students learn the cross-cultural history of an evolving form of pop music and create their own mesmerizing beats.
NYU and Sony’s Personal Entertainment Business Launch the Sony Audio Institute
The Institute will foster innovation at the intersection of technology and music. Larry S. Miller will be the inaugural executive director.
MARL Faculty Win $700,000 NIH Grant
Two faculty members from Steinhardt's Music and Audio Research Laboratory will study the feasibility of an intentional music listening intervention for stroke patients’ mental health recovery.
Related Programs
Related Department
Music and Performing Arts Professions
35 W. 4th Street, 2nd Floor
New York, NY 10012
212-998-5424
mpap@nyu.edu