FG 2026 Oral Presentation

StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

A unified kinship face synthesis framework for generating diverse, high-fidelity child and partner faces with controllable age, gender, and relational resemblance.

StyleGAN S-space Diffusion Transformer Relational Trait Guidance
Pin-Yen Chiu* Dai-Jie Wu* Po-Hsun Chu Chia-Hsuan Hsu Hsiang-Chen Chiu Chih-Yu Wang Jun-Cheng Chen

Research Center for Information Technology Innovation, Academia Sinica

* Equal contribution

Abstract

Modeling kinship as a controllable latent distribution.

StyleDiT combines a diffusion transformer with StyleGAN's style latent space to sample kinship-aware face latents from two conditioning images. The framework handles both child prediction from parents and partner prediction from a child-parent pair, while preserving fine-grained controls over age and gender. Relational Trait Guidance gives independent control over each conditioning face, improving the balance between diversity and fidelity.

Unified Tasks

One framework supports child synthesis from father-mother pairs and partner synthesis from a child and one parent.

Diverse Samples

Sampling in StyleGAN S-space with a diffusion prior produces multiple plausible outcomes from the same kinship conditions.

Attribute Control

Age, gender, and parent-specific resemblance can be adjusted while keeping high visual fidelity from the StyleGAN generator.

Interactive Browser

Diverse generated children from selected parent pairs.

The interface is ready for the offline generated assets: two fathers, two mothers, gender tabs, an age slider, and four stochastic child predictions for each condition.

Father Father A
Mother Mother A
Condition Daughter, 12 years

Partner Synthesis

Predicting the missing parent from a child-parent pair.

Figure 7 demonstrates the second task in StyleDiT: given a child and one known parent, the same framework synthesizes diverse candidate partners while preserving plausible family traits.

Child + mother -> father Child + father -> mother
Partner face synthesis examples from Figure 7
Partner face synthesis from Figure 7, including comparison with ParentGAN and diverse stochastic predictions.

Method

Style latents, transformer denoising, and RTG.

Input faces are encoded into StyleGAN style latents. StyleDiT denoises a sampled latent conditioned on both inputs, then StyleGAN2 decodes the result. Relational Trait Guidance extends classifier-free guidance to independently tune the influence of each condition.

Overview diagram of the StyleDiT pipeline, diffusion process, denoising transformer, and tokenizer
StyleDiT encodes input faces, samples a kinship-aware StyleGAN S-space latent with a diffusion transformer, and decodes the output through frozen StyleGAN2.
Image Encoder Maps input faces into style latents used as parent or child-partner conditions.
StyleDiT Prior Models the complex distribution of kinship relationships in 9,088-dimensional S-space.
RTG Weights each relational condition independently during inference for controllable resemblance.
StyleGAN2 Decodes predicted latents into high-resolution faces while preserving attribute controls.

Results

Controllability, baselines, and data limits.

We highlight the experiments most tied to StyleDiT's contributions: Relational Trait Guidance, qualitative child synthesis against prior baselines, and the limitation of relying on scarce real kinship data.

Relational Trait Guidance

RTG exposes a direct diversity and resemblance control.

Varying father and mother guidance scales shifts the generated child toward the selected condition. The ablation also shows that RTG notably improves diversity while preserving competitive identity similarity.

Child faces generated under different father and mother RTG guidance scales
Changing parent-specific RTG scales controls resemblance during inference.
Diversity comparison between StyleDiT with RTG, without RTG, and baseline methods
RTG increases diversity among generated children from the same parent pair.
Ablation table for diffusion process and RTG
Diffusion and RTG improve diversity while preserving competitive identity similarity.

Child Prediction

Qualitative comparison against child face generation baselines.

Compared with StyleGene, ChildNet, KinStyle, ChildPredictor, and FreeMorph, StyleDiT better balances diversity, visual quality, parental traits, and explicit age/gender control.

Qualitative comparison of synthesized child faces on FIW and TSKinFace
Qualitative comparison on FIW and TSKinFace across multiple child face synthesis baselines.

Real Data Ablation

Existing real kinship data is not enough by itself.

Training with synthetic plus real data, fine-tuning on real data, or using real data only did not improve overall identity similarity. This supports the paper's observation that current real kinship datasets remain limited in quantity and quality.

Qualitative comparison of using synthetic-only, synthetic plus real, real fine-tuning, and real-only training
Real-data variants can capture coarse facial contours but are less stable than the default synthetic-only setup.
Identity similarity comparison for using real data configurations
Identity similarity drops under the real-data variants across FIW, TSKinFace, and FF-Database.

Citation

BibTeX

Accepted to the 2026 IEEE International Conference on Automatic Face and Gesture Recognition as an oral presentation.

@inproceedings{chiu2026styledit,
  title={StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer},
  author={Chiu, Pin-Yen and Wu, Dai-Jie and Chu, Po-Hsun and Hsu, Chia-Hsuan and Chiu, Hsiang-Chen and Wang, Chih-Yu and Chen, Jun-Cheng},
  booktitle={2026 IEEE International Conference on Automatic Face and Gesture Recognition (FG)},
  year={2026}
}