Understanding PSMGD: Fast Multi-Objective Optimization with Periodic Gradient Weights


Introduction

Modern machine learning systems rarely optimize only one clean objective. A multi-task perception model may need to segment an image, estimate depth, and predict surface normals at the same time. A graph neural network for molecular property prediction may need to predict many chemical quantities from one shared representation. A ranking system may need to balance relevance, fairness, diversity, and user satisfaction.

This is the setting of multi-objective optimization (MOO). Instead of minimizing one loss, we optimize a vector of losses:

$$ \min_{\mathbf{x} \in \mathcal{D}} \mathbf{F}(\mathbf{x}) := \left[ f_1(\mathbf{x}), f_2(\mathbf{x}), \cdots, f_S(\mathbf{x}) \right], $$

where $\mathbf{x} \in \mathcal{D} \subseteq \mathbb{R}^d$ denotes the model parameters and $f_s$ is the $s$-th objective. In multi-task learning, each $f_s$ is often the loss of one task.

The paper PSMGD: Periodic Stochastic Multi-Gradient Descent for Fast Multi-Objective Optimization studies a practical bottleneck in this setting. Gradient manipulation methods can produce good update directions by dynamically combining task gradients, but they often pay a high computational cost at every training iteration. PSMGD asks a simple question:

If the dynamic task weights do not change much over short training intervals, do we really need to recompute them at every iteration?

The answer proposed in the paper is no. PSMGD computes dynamic weights periodically, reuses them between recomputation steps, and still preserves strong convergence guarantees.

Paper links:

Why Multi-Objective Optimization Is Hard

The most direct way to handle multiple losses is linear scalarization:

$$ \min_{\mathbf{x}} \sum_{s=1}^{S} \lambda_s f_s(\mathbf{x}), \quad \lambda_s \ge 0, \quad \sum_{s=1}^{S} \lambda_s = 1. $$

This converts MOO into a standard single-objective problem. It is computationally cheap because one weighted loss can be backpropagated once.

The weakness is that fixed weights can behave poorly when task gradients conflict. Let $\nabla f_i(\mathbf{x})$ and $\nabla f_j(\mathbf{x})$ be two objective gradients. If their inner product is negative,

$$ \nabla f_i(\mathbf{x})^\top \nabla f_j(\mathbf{x}) < 0, $$

then moving in the direction that helps one objective may hurt the other. This is common in multi-task learning: improving one task’s loss can temporarily damage another task’s loss.

The more principled target is not a single optimum for one scalar loss, but a Pareto stationary point. A point $\mathbf{x}$ is Pareto stationary if there is no direction $\mathbf{d}$ that decreases every objective at once:

$$ \nabla f_s(\mathbf{x})^\top \mathbf{d} < 0, \quad \forall s \in [S]. $$

This condition is important because exact Pareto optimality is generally too hard to seek directly in large non-convex models. Pareto stationarity is a practical first-order goal.

The Standard Gradient-Manipulation View

Multiple Gradient Descent Algorithm (MGDA) is a useful reference point. Given the gradients of all objectives, MGDA finds a convex combination with minimum norm:

$$ \boldsymbol{\lambda}^{*}(\mathbf{x}) = \operatorname*{argmin}_{\boldsymbol{\lambda}} \left\| \sum_{s=1}^{S} \lambda_s \nabla f_s(\mathbf{x}) \right\|^2, \quad \sum_{s=1}^{S} \lambda_s = 1, \quad \lambda_s \ge 0. $$

The update direction is then

$$ \mathbf{d} = \sum_{s=1}^{S} \lambda_s^{*} \nabla f_s(\mathbf{x}). $$

If $\mathbf{x}$ is not Pareto stationary, this type of direction can serve as a common descent direction. This is the main strength of gradient manipulation methods: they adapt weights according to the current gradient geometry rather than relying on static hand-tuned coefficients.

The cost is the extra dynamic-weight computation. In many deep-learning systems, one must compute or approximate each task gradient, solve an additional optimization problem, combine the gradients, and then update the model. This can become expensive when:

  • the number of objectives $S$ is large,
  • the model dimension $d$ is large,
  • the method recomputes dynamic weights at every iteration,
  • the training loop is already bottlenecked by backpropagation.

This is the bottleneck PSMGD targets.

The Key Observation Behind PSMGD

The paper’s core observation is empirical and algorithmic:

Dynamic weights often change slowly over short intervals.

In the pedagogical Fonseca example, MGDA’s dynamic weights converge quickly and then stay stable. In the paper’s multi-task learning studies, the same idea appears at both epoch and iteration scales: the average dynamic weights are stable and show low variability.

This suggests that recomputing $\boldsymbol{\lambda}_t$ at every iteration may be wasteful. If the current dynamic weights are close to the previous dynamic weights, we can reuse the old weights for several iterations and spend the saved computation on actual model updates.

This is the central design idea of Periodic Stochastic Multi-Gradient Descent (PSMGD).

Algorithmic Architecture

PSMGD has three moving parts:

  1. Periodic weight calculation: every $R$ iterations, solve the dynamic-weight problem.
  2. Weight smoothing: mix the newly computed weights with the previous periodic weights.
  3. Weight reuse: between periodic calculation steps, reuse the most recent weights and train with a weighted loss.

The architecture can be read as:

Objective losses f_1, ..., f_S
        |
        v
Compute task gradients when t % R == 0
        |
        v
Solve dynamic-weight QP for lambda_hat_t
        |
        v
Momentum smoothing: lambda_t = alpha_t lambda_{t-R} + (1 - alpha_t) lambda_hat_t
        |
        v
Reuse lambda_t for intermediate iterations
        |
        v
Weighted gradient direction d_t = sum_s lambda_{t,s} grad f_s
        |
        v
Parameter update x_{t+1} = x_t - eta_t d_t

More formally, when $t \bmod R = 0$, PSMGD computes:

$$ \hat{\boldsymbol{\lambda}}_t^{*} \in \operatorname*{argmin}_{\boldsymbol{\lambda}} \left\| \sum_{s \in [S]} \lambda_s \nabla f_s(\mathbf{x}_t, \xi_t) \right\|^2, \quad \text{s.t.} \quad \sum_{s \in [S]} \lambda_s = 1, \quad \lambda_s \ge 0. $$

Here $\xi_t$ denotes the stochastic mini-batch randomness. The computed weights are then smoothed:

$$ \boldsymbol{\lambda}_t = \alpha_t \boldsymbol{\lambda}_{t-R} + (1-\alpha_t)\hat{\boldsymbol{\lambda}}_t^{*}. $$

When $t \bmod R \ne 0$, PSMGD simply reuses the previous weights:

$$ \boldsymbol{\lambda}_t = \boldsymbol{\lambda}_{t-1}. $$

Finally, it constructs the stochastic common direction:

$$ \mathbf{d}_t = \sum_{s \in [S]} \lambda_{t,s} \nabla f_s(\mathbf{x}_t, \xi_t), $$

and updates the parameters:

$$ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta_t \mathbf{d}_t. $$

This is a small change in code structure but a large change in computational behavior. Most iterations no longer need to solve the dynamic-weight problem.

Pseudocode

Input: initial parameter x_0, learning rates eta_t, period R
Initialize lambda_0

for t = 0, 1, ..., T - 1:
    if t mod R == 0:
        compute stochastic gradients grad f_s(x_t, xi_t) for all objectives
        solve the MGDA-style quadratic problem to obtain lambda_hat_t
        lambda_t = alpha_t lambda_{t-R} + (1 - alpha_t) lambda_hat_t
    else:
        lambda_t = lambda_{t-1}

    d_t = sum_s lambda_{t,s} grad f_s(x_t, xi_t)
    x_{t+1} = x_t - eta_t d_t

The hyperparameter $R$ controls the trade-off:

  • $R=1$ resembles standard dynamic gradient manipulation with frequent weight updates.
  • Larger $R$ reduces dynamic-weight computation.
  • If $R$ is too large, the reused weights may become stale.

The main symbols are:

  • $S$: number of objectives or tasks.
  • $R$: period for recomputing the dynamic weights.
  • $\eta_t$: learning rate at iteration $t$.
  • $\boldsymbol{\lambda}_t$: objective-weight vector used at iteration $t$.
  • $\alpha_t$: momentum coefficient for smoothing newly computed weights.
  • $\xi_t$: stochastic mini-batch randomness.

The algorithm is therefore best understood as a principled “lazy update” scheme for multi-objective gradient weights.

Why Reusing Weights Can Still Work

The risk of reusing $\boldsymbol{\lambda}$ is that the training landscape changes while the weights stay fixed. The paper handles this by analyzing two cases:

  1. Calculation steps, where the current stochastic gradients influence the newly computed weights.
  2. Reuse steps, where the weights are fixed from previous steps and are independent of the current mini-batch sampling.

This distinction matters because the stochastic analysis is different in the two cases. At calculation steps, the weight vector depends on the same stochastic gradients used in the update. At reuse steps, the weight vector behaves like a fixed coefficient vector for the current stochastic gradient estimate.

The paper defines the weighted objective:

$$ G(\mathbf{x}_t,\boldsymbol{\lambda}_t) = \sum_{s \in [S]} \lambda_{t,s} f_s(\mathbf{x}_t), $$

with gradient:

$$ \nabla G(\mathbf{x}_t,\boldsymbol{\lambda}_t) = \sum_{s \in [S]} \lambda_{t,s} \nabla f_s(\mathbf{x}_t). $$

The convergence analysis then studies progress under common assumptions such as smoothness, bounded stochastic-gradient variance, bounded weights, and bounded gradients.

At a high level, the theory says:

  • For strongly convex objectives, PSMGD reaches an $\mathcal{O}(1/T)$ convergence rate.
  • For general convex objectives, PSMGD reaches an $\mathcal{O}(1/\sqrt{T})$ rate.
  • For non-convex objectives, PSMGD also reaches an $\mathcal{O}(1/\sqrt{T})$ stationarity-style rate.

These match state-of-the-art iteration rates for stochastic MOO methods, while reducing the number of expensive dynamic-weight computations.

Backpropagation Complexity

Iteration complexity alone is not enough for MOO. Two methods can have the same iteration rate but very different per-iteration costs. For deep learning, the more relevant unit is often backpropagation.

The paper introduces Backpropagation Complexity (BP complexity):

the total number of backpropagation operations required to reach a specified performance threshold $\epsilon$.

This metric captures the practical cost of computing per-objective gradients. PSMGD’s main efficiency result is that periodic weight calculation reduces BP complexity by a factor related to $R$.

For strongly convex functions, the paper reports:

$$ \mathcal{O} \left( \frac{S}{\epsilon R} + \frac{R-1}{\epsilon R} \right). $$

For general convex and non-convex functions, it reports:

$$ \mathcal{O} \left( \frac{S}{\epsilon^2 R} + \frac{R-1}{\epsilon^2 R} \right). $$

The interpretation is direct:

  • The term involving $S$ comes from steps where all objective gradients are needed for dynamic-weight calculation.
  • The term involving $R-1$ comes from reuse steps, which behave more like a weighted single-objective update.
  • Increasing $R$ amortizes the expensive all-objective gradient computation.

Most importantly, if $R = \Omega(S)$, the BP complexity becomes objective-independent in order:

$$ \mathcal{O}(1/\epsilon) \quad \text{for strongly convex functions}, $$

and

$$ \mathcal{O}(1/\epsilon^2) \quad \text{for general convex and non-convex functions}. $$

This is the most important theoretical message of the paper: with a suitable update period, PSMGD can require the same order of backpropagation operations as classic SGD for single-objective learning.

Connection to Multi-Task Learning

In multi-task learning, a common model has shared parameters and task-specific heads. A typical training loss can be written as:

$$ \mathcal{L}(\theta) = \sum_{s=1}^{S} \lambda_s \mathcal{L}_s(\theta_{\mathrm{shared}}, \theta_s), $$

where $\theta_{\mathrm{shared}}$ denotes shared parameters and $\theta_s$ denotes task-specific parameters.

The difficult part is choosing $\lambda_s$. Static weights are simple but may ignore conflict. Fully dynamic weights are adaptive but costly. PSMGD sits between them:

  • It preserves a gradient-aware dynamic weighting mechanism.
  • It avoids recomputing the dynamic weights at every iteration.
  • It uses periodic recomputation to keep the weights aligned with the current training state.

This makes PSMGD especially natural for deep multi-task models where the main bottleneck is repeatedly computing and manipulating multiple task gradients.

Experimental Results in Brief

The paper evaluates PSMGD mainly in multi-task learning settings, using datasets with 2 to 40 objectives:

  • QM-9 for molecular property regression with 11 tasks.
  • Multi-MNIST for two-task image classification.
  • CelebA for 40-task facial attribute classification.
  • NYU-v2 for dense prediction with segmentation, depth, and surface normal tasks.
  • CityScapes for dense prediction with two tasks.

The experimental section compares PSMGD with linear scalarization and many MOO or multi-task learning baselines, including MGDA, PCGrad, CAGrad, IMTL-G, NashMTL, FAMO, FairGrad, and SDMGrad. The paper reports final task quality using two aggregate metrics:

$$ \Delta m\% = \frac{1}{N} \sum_{n=1}^{N} (-1)^{\delta_n} \frac{M_{m,n}-M_{B,n}}{M_{B,n}} \times 100, $$

where $M_{B,n}$ is the single-task baseline value for metric $n$, $M_{m,n}$ is method $m$’s value, and $\delta_n$ marks whether higher values are better. Lower $\Delta m%$ is better. The second metric is Mean Rank (MR), the average rank of a method across task metrics; lower MR is also better.

The main takeaway is not that PSMGD wins every single task metric. Instead, the message is more practical:

  1. PSMGD is competitive in final task quality and sometimes improves aggregate task metrics.
  2. It substantially reduces training time compared with classical gradient manipulation methods.
  3. It often reaches target loss thresholds earlier, especially in early training.
  4. It gives a strong performance-efficiency trade-off, which is exactly the design goal.

For example, on QM-9, PSMGD has the shortest total time to reach the reported average-loss thresholds, even though linear scalarization has the cheapest per-epoch cost. On NYU-v2, PSMGD has strong dense-prediction convergence behavior and reaches several task-loss thresholds fastest among the compared methods.

What I Think Is the Core Principle

The main principle behind PSMGD is:

Do expensive coordination only when it is likely to matter.

In MOO, gradient manipulation is a coordination mechanism among objectives. It decides how the objectives should negotiate an update direction. But if the negotiation result changes slowly, then renegotiating every iteration is unnecessary.

PSMGD exploits this temporal structure. It does not discard dynamic weighting; it schedules it. This is why the method is simple but meaningful: it respects the geometry of multi-objective descent while reducing redundant computation.

There is also a broader lesson for optimization algorithm design. A method can be theoretically elegant but practically slow if it optimizes the wrong unit of cost. PSMGD explicitly shifts attention from iteration count to backpropagation count, which is closer to the real cost in deep learning.

Limitations and Open Questions

PSMGD introduces the period hyperparameter $R$, and this choice matters. If $R$ is too small, the method approaches ordinary dynamic gradient manipulation and saves less computation. If $R$ is too large, the reused weights may become stale and less aligned with the current gradients.

The method also assumes that dynamic weights are sufficiently stable over short intervals. This is often reasonable in the studied settings, but it may be less reliable in highly non-stationary training regimes, rapidly changing curricula, reinforcement learning, or tasks whose losses change at very different speeds.

Several natural extensions follow:

  • Adaptive period selection: update weights when gradient conflict changes enough, not only every fixed $R$ steps.
  • Preference-aware PSMGD: combine periodic weighting with user-specified trade-off preferences.
  • Pareto set learning: use the efficiency gain to explore multiple trade-off solutions rather than one solution.
  • Distributed or federated MOO: reduce communication and backpropagation cost when many objectives or clients are involved.

Takeaways

PSMGD is built around a clean observation: dynamic MOO weights are often stable enough to reuse. By periodically computing gradient-aware weights and reusing them between updates, the method preserves the benefits of gradient manipulation while reducing its most painful computational overhead.

The paper’s central contributions are:

  • a periodic stochastic multi-gradient algorithm for fast MOO,
  • convergence guarantees for strongly convex, convex, and non-convex objectives,
  • a BP complexity metric that better reflects deep-learning training cost,
  • objective-independent BP complexity when $R$ scales with the number of objectives,
  • empirical evidence that the method improves the performance-time trade-off in multi-task learning.

In one sentence: PSMGD makes multi-objective gradient manipulation more practical by realizing that task-weight negotiation does not need to happen at every step.


Author: Mingjing Xu
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Mingjing Xu !
 Current
Understanding PSMGD: Fast Multi-Objective Optimization with Periodic Gradient Weights Understanding PSMGD: Fast Multi-Objective Optimization with Periodic Gradient Weights
A technical explanation of PSMGD, covering multi-objective optimization, gradient conflict, periodic stochastic multi-gradient descent, convergence intuition, backpropagation complexity, and key empirical findings.
2026-06-04
Next 
Speaking the Language of Probability-From Classifier Evaluation to Generative Classifiers Speaking the Language of Probability-From Classifier Evaluation to Generative Classifiers
Study notes connecting classifier accuracy, confusion matrices, precision, recall, ROC/AUC, generative classifiers, Gaussian class-conditional models, sigmoid/softmax, LDA, and QDA.
2025-10-30
  TOC