I once worked on a project where a critical system started to leak memory during a critical weekend-long event. We basically took shifts every 4 hours to log into the boxes and manually rolling-restart the affected services.
We did find and fix the leak the following week once the event completed and traffic dropped to normal levels. But sometimes, you got to do what you got to do to keep the app running.
We did find and fix the leak the following week once the event completed and traffic dropped to normal levels. But sometimes, you got to do what you got to do to keep the app running.