I am talking about training models on thousands of machines, each with thousands of GPU streaming processors.
For data parallelism, if you want deterministic results, you need to merge weights (AllReduce, in the general case) in a deterministic way. So, either you need a way to wait until they all catch up to the same progress (go as slow as the weakest link), or fix differences due to data skew afterward. AFAIK, no one has developed reversible computation in DL in a way that allow fixing the data skew post-facto in the general case. (1)
For model parallelism, you are bound by the other graph nodes that computation depends on.
This problem can be seen in large-scale reinforcement learning or simulation, or other active learning scenarios, where exploring the unknown environment/data at different speeds can skew the learning. A simple example: imaging a VR world where the pace at which you can generate experiences depends on the amount of objects in the scene, and that there are parts of the world that are computationally expensive but provide few rewards to sustain explorations (deserts) before an agent can reach a reward-rich area; (without "countermeasures") it is less likely that agents will be able to reach the reward-rich area if there are other venues of exploration, even if the global optimum solution lies there.
(1) IMHO, finding a solution to this problem that doesn't depend on storing or recomputing gradients is equivalent to finding a training algorithm that can work in presence of skewed/unhomogeneous datasets for the forward-forward approach https://www.cs.toronto.edu/~hinton/FFA13.pdf that Geoffrey Hinton proposed.
For data parallelism, if you want deterministic results, you need to merge weights (AllReduce, in the general case) in a deterministic way. So, either you need a way to wait until they all catch up to the same progress (go as slow as the weakest link), or fix differences due to data skew afterward. AFAIK, no one has developed reversible computation in DL in a way that allow fixing the data skew post-facto in the general case. (1)
For model parallelism, you are bound by the other graph nodes that computation depends on.
This problem can be seen in large-scale reinforcement learning or simulation, or other active learning scenarios, where exploring the unknown environment/data at different speeds can skew the learning. A simple example: imaging a VR world where the pace at which you can generate experiences depends on the amount of objects in the scene, and that there are parts of the world that are computationally expensive but provide few rewards to sustain explorations (deserts) before an agent can reach a reward-rich area; (without "countermeasures") it is less likely that agents will be able to reach the reward-rich area if there are other venues of exploration, even if the global optimum solution lies there.
(1) IMHO, finding a solution to this problem that doesn't depend on storing or recomputing gradients is equivalent to finding a training algorithm that can work in presence of skewed/unhomogeneous datasets for the forward-forward approach https://www.cs.toronto.edu/~hinton/FFA13.pdf that Geoffrey Hinton proposed.