This repository contains the official implementation of DiffusionRank, a diffusion-based deep generative approach to Learning-to-Rank (LTR) that models the joint distribution over feature vectors and relevance labels.
Traditional Learning-to-Rank methods use discriminative machine learning approaches that model the probability of document relevance given query-document features. DiffusionRank takes an alternative approach by extending TabDiff, a denoising diffusion-based generative model for tabular data, to create generative equivalents of classical discriminative pointwise and pairwise LTR objectives.
- Generative LTR: Models the full joint distribution over features and relevance labels
- Pointwise & Pairwise: Supports both pointwise and pairwise training approaches
- Efficient Inference: Single-step denoising for relevance prediction, comparable to discriminative models
- Flexible Architecture: Uses feedforward networks with minimal parameter overhead
DiffusionRank/
├── generative/ # DiffusionRank implementation
│ ├── main.py # Main training/testing script
│ ├── tabdiff/ # TabDiff-based diffusion modules
│ │ ├── models/ # Diffusion model implementations
│ │ ├── modules/ # Neural network modules
│ │ ├── trainer.py # Training logic
│ │ └── tabdiff_configs.toml # Model configurations
│ └── synetune_launcher.py # Hyperparameter tuning
├── discriminative/ # Discriminative baseline models
│ ├── model.py # Neural network architecture
│ ├── pointwise.py # Pointwise discriminative model
│ ├── pairwise.py # Pairwise discriminative model (RankNet)
│ ├── pointwise_perturbed.py # Pointwise with perturbed features
│ ├── pairwise_perturbed.py # Pairwise with perturbed features
│ ├── listwise_lambdarank.py # LambdaRank implementation
│ └── xgb.py # XGBoost baseline
├── EDA/ # Exploratory data analysis
├── ltr_dataset_to_numpy.py # Convert raw LTR data to numpy and create fraction subsets
├── compute_ranking_metrics.py # Evaluation script
├── ndcg_significance_test.py # Statistical significance testing
└── utils.py # Utility functions
pip install -r requirements.txt- Python 3.8+
- PyTorch
- NumPy
- scikit-learn
- Weights & Biases (wandb)
We evaluate on three standard LTR benchmarks:
| Dataset | Queries (Train/Val/Test) | Features | Relevance Labels |
|---|---|---|---|
| MQ2007 | 1,017 / 339 / 336 | 46 | 3 (0-2) |
| MQ2008 | 471 / 157 / 156 | 46 | 3 (0-2) |
| MSLR-WEB10K | 6,000 / 2,000 / 2,000 | 136 | 5 (0-4) |
Place raw LETOR files under data/{dataset}/raw/. Then run:
python ltr_dataset_to_numpy.py --dataset MQ2007 --fold 1This converts raw text to numpy and creates fraction subsets in one step, producing:
data/
├── MQ2007/
│ └── by_fraction/
│ └── Fold1/
│ ├── k1.0/
│ │ ├── X_train.npy, y_train.npy, idx_train.npy
│ │ ├── X_train_non.npy, y_train_non.npy, idx_train_non.npy
│ │ ├── X_val.npy, y_val.npy, idx_val.npy
│ │ └── X_test.npy, y_test.npy, idx_test.npy
│ ├── k0.5/
│ │ └── ...
│ ├── k0.25/
│ ├── ...
├── MQ2008/
│ └── ...
└── MSLR-WEB10K/
└── ...
Fraction values: 1.0, 0.5, 0.25, 0.0625, 0.015625, 0.00390625
python generative/main.py \
--dataname MQ2007 \
--approach pointwise \
--mode train \
--steps 15000 \
--lr 5e-6 \
--batch_size 4096 \
--dim_t 256 \
--num_layers 4 \
--gpu 0python generative/main.py \
--dataname MQ2007 \
--approach pairwise \
--mode train \
--steps 15000 \
--lr 5e-6 \
--batch_size 4096 \
--dim_t 256 \
--num_layers 4 \
--gpu 0Use the --k parameter to train with a subset of data:
python generative/main.py \
--dataname MSLR-WEB10K \
--approach pointwise \
--mode train \
--k 0.25 \
--gpu 0python discriminative/pointwise.py \
--dataset MQ2007 \
--task train \
--num_hidden_nodes 256 \
--lr 5e-6 \
--k 1.0python discriminative/pairwise.py \
--dataset MQ2007 \
--task train \
--num_hidden_nodes 256 \
--lr 5e-6 \
--k 1.0python baselines/xgb.py \
--dataset MQ2007 \
--approach pointwise \
--k 1.0python generative/main.py \
--dataname MQ2007 \
--approach pointwise \
--mode test \
--ckpt_path checkpoints/MQ2007/your_experiment/best_model.pt \
--gpu 0Evaluate predictions using NDCG@10 and MAP@10:
python compute_ranking_metrics.py --run_file predictions/your_predictions.txtpython ndcg_significance_test.py| Argument | Description | Default |
|---|---|---|
--dataname |
Dataset name (MQ2007, MQ2008, MSLR-WEB10K, MSLR-WEB30K) | - |
--approach |
Training approach (pointwise, pairwise) | pointwise |
--mode |
Mode (train, test) | train |
--steps |
Number of training steps | 15000 |
--lr |
Learning rate | 5e-6 |
--batch_size |
Batch size | 4096 |
--dim_t |
Hidden dimension | 256 |
--num_layers |
Number of hidden layers | 4 |
--k |
Fraction of training data to use | 1.0 |
--gpu |
GPU index (-1 for CPU) | 0 |
--no_wandb |
Disable Weights & Biases logging | False |
| Argument | Description | Default |
|---|---|---|
--dataset |
Dataset name | - |
--task |
Task (train, test) | - |
--num_hidden_nodes |
Hidden layer size | - |
--lr |
Learning rate | 5e-6 |
--k |
Fraction of training data | 1.0 |
--checkpoint |
Path to model checkpoint (for testing) | None |
--no_wandb |
Disable Weights & Biases logging | False |
DiffusionRank extends the discriminative model architecture by:
- Input: Adding the (possibly masked) relevance label and diffusion time step as additional inputs
- Output: Jointly predicting the relevance label and the noise added to features
The model uses a feedforward network with:
- 4 hidden layers (default)
- SiLU activation
- Layer normalization
- Dropout (0.1)
- Hidden size: 256 (LETOR 4.0) or 1024 (MSLR-WEB10K)
DiffusionRank demonstrates consistent improvements over discriminative baselines on standard LTR benchmarks, particularly on larger datasets like MSLR-WEB10K. The generative training objective helps the model learn more robust representations by modeling the joint distribution of features and labels.
Training progress is logged to Weights & Biases. To disable logging:
python generative/main.py --dataname MQ2007 --approach pointwise --mode train --no_wandbThis work builds upon TabDiff, a mixed-type diffusion model for tabular data generation.
