ACM Multimedia 2026 Workshop

JAV-CG The 1st International Workshop on Joint Audio-Video Comprehension and Generation

A dedicated forum for the next wave of audio-video intelligence, spanning robust multimodal understanding, synchronized generation, and unified models that bridge perception and creation.

JavisVerse logo
AUDIO
VIDEO
JOINT AI
VenueRio de Janeiro, Brazil
Dates10-14 November 2026
Submission16 July 2026
ScopeComprehension, Generation, Unified AV
About

Joint audio-video intelligence as one research problem.

Real-world multimedia is naturally audio-visual, yet many multimodal systems still treat sound as a side channel. JAV-CG focuses on models that listen, see, reason, and generate synchronized multimedia in one coherent framework.

The workshop brings together multimedia, audio and speech, computer vision, and multimodal foundation-model communities to sharpen the research agenda for audio-video comprehension and generation.

Important Dates

Submission timeline and event milestones.

01

Workshop Paper Submission

16 July 2026

Official ACM MM 2026 workshop contribution deadline.

02

Author Notification

06 August 2026

Acceptance notification for workshop submissions.

03

Camera-Ready / Final Metadata

20 August 2026

Final accepted material and metadata due under the ACM MM 2026 workshop schedule.

04

Author Registration

20 August 2026

Registration deadline for accepted workshop contributions.

05

ACM Multimedia 2026

10-14 November 2026

Conference venue: Rio de Janeiro, Brazil.

06

Workshop Day

Coming Soon

Final agenda and exact workshop day will be updated after logistics are confirmed.

Call for Papers

Welcomes submissions across understanding, generation, and unified AV modeling.

JAV-CG welcomes archival workshop papers intended for the ACM MM 2026 workshop proceedings, as well as non-archival featured-paper submissions for workshop presentation. Technical, position, and perspective papers may be up to 8 pages plus references.

ACM format Double blind English only Archival + non-archival

Audio-Visual Comprehension

  • Sound source localization and source separation
  • Audio-visual event detection and localization
  • Question answering, grounding, and scene reasoning
  • Trustworthy and long-form audio-visual understanding

Audio-Video Generation

  • Video-to-audio and text-to-audio-video synthesis
  • Audio-driven video generation and talking heads
  • Foley, spatial audio, multimodal editing, and music
  • Controllable synchronized generation across modalities

Unified AV Frameworks

  • Any-to-any multimodal generation involving audio and video
  • Joint tokenization, alignment, and representation learning
  • Unified encoder-decoder or MLLM architectures
  • Benchmarks, datasets, metrics, safety, and evaluation

Submission portal is live on OpenReview.

Best Paper Award

We will present a Best Paper Award to recognize outstanding workshop submissions.

Keynote Speakers

Confirmed keynote speakers shaping audio-video intelligence.

Yapeng Tian

Yapeng Tian

Assistant Professor, University of Texas at Dallas

Homepage
Title Keynote title coming soon
Research Focus

His research integrates computer vision, audition, and machine learning for multisensory perception, audio-visual scene understanding, audio-visual scene generation, accessibility, healthcare, and image/video processing.

Speaker Bio

Yapeng Tian is an Assistant Professor in the Computer Science Department at UT Dallas, where he leads the Computer Vision and Multimodal Computing Lab. His work has been recognized by the AAAI New Faculty Highlights, Cisco Faculty Research Award, and Amazon Research Award.

Chenliang Xu

Chenliang Xu

Associate Professor, University of Rochester

Homepage
Title Keynote title coming soon
Research Focus

His research teaches machines to understand dynamic visual scenes through video, sound, and language, spanning computer vision, audio-visual learning, and trustworthy AI.

Speaker Bio

Chenliang Xu is a tenured Associate Professor of Computer Science at the University of Rochester and affiliated faculty of the Goergen Institute for Data Science and Artificial Intelligence. He received his Ph.D. from the University of Michigan in 2016 and was honored with the 2025 Edmund A. Hajim Outstanding Faculty Award.

More keynote speakers

Additional invited speakers will be announced soon.

Status Program details and invited talk abstracts will be updated as they are confirmed.
Tentative Schedule

Program schedule is TBD.

Program Block 01

Schedule details TBD

Session Duration Speaker Affiliation
Session TBD 01 Duration TBD Speaker TBD Affiliation TBD
Session TBD 02 Duration TBD Speaker TBD Affiliation TBD
Session TBD 03 Duration TBD Speaker TBD Affiliation TBD
Break TBD Duration TBD - -
Session TBD 04 Duration TBD Speaker TBD Affiliation TBD
Session TBD 05 Duration TBD Speaker TBD Affiliation TBD
Program Block 02

Schedule details TBD

Session Duration Speaker Affiliation
Session TBD 06 Duration TBD Speaker TBD Affiliation TBD
Session TBD 07 Duration TBD Speaker TBD Affiliation TBD
Session TBD 08 Duration TBD Speaker TBD Affiliation TBD
Break TBD Duration TBD - -
Session TBD 09 Duration TBD Speaker TBD Affiliation TBD
Session TBD 10 Duration TBD Speaker TBD Affiliation TBD
Organizers

An international team across multimedia, AV learning, and foundation models.

You Qin

You Qin

National University of Singapore

Homepage
Shengqiong Wu

Shengqiong Wu

University of Oxford

Homepage
Liang Zheng

Liang Zheng

Australian National University

Homepage
Roger Zimmermann

Roger Zimmermann

National University of Singapore

Homepage
Jiebo Luo

Jiebo Luo

University of Rochester

Homepage
Tat-Seng Chua

Tat-Seng Chua

National University of Singapore

Homepage
Contact

Workshop correspondence