Thoughts on how to deal with Salesforce flow failures from a discussion with a client:

For reliability and maintainability, my recommendations in order of preference would be:

  1. document how the flows work, and rewrite as apex triggers with thorough test coverage
  2. document how the flows work, and start building tests for the record-triggered flows (GA as of Winter '23)
  3. build the batch apex job to close Opportunities and continue troubleshooting flow issues as they occur

My general opinion is that once flows pass a threshold of complexity (say, no longer fit on a single screen), they become difficult to maintain. Building software with no-code tools like Flows is too different from building software in a programming language, and one should follow similar practices for building reliable, maintainable software such as using decomposition to make the software easy to understand, performing peer review, and writing automated tests. Salesforce doesn't make it easy to do these things with flows, but it's getting better (with tests for record-triggered flows, for example).

Trying to learn something new every day