There are two "directions" along which you can parallelize:
- explore different parts of the parameter hyperspace in parallel.
- for a given parametrization, split the model and/or objective function so that its parts can be computed in parallel.
The second approach is model-specific, and gives you nice speedups (make your model N times faster, and you will converge N times faster), but is often not particularly well-suited for accelerators (which includes GPUs) due to the latency of moving data back and forth, but again with model-specific tuning you can maybe make it work. Most of the time SIMD on CPU is best here for most traditional problems.
The first approach, which already pretty much requires the second so that the model and the optimization can run on the same computing unit, well it isn't particularly great since you're doing computations that are suboptimal and/or redundant to begin with. Any speedup isn't obvious, depends on the optimization algorithm and the convergence characteristics of your problem. Also as you follow along some paths in parallel, you'll eventually need to sync up, and since they have divergent control flow, this means you're not able to make the most of the computing resources which will be stalling quite often. Often, with enough tuning for your particular problem and method, you can make it work.
So why don't generic libraries do it on GPU? Because unless you tune everything for your particular problem, it's just not going to perform as well as on CPU.