Great post, and this is something we've faced as well. Luckily our jobs are mainly idempotent, and the ones which are not, aren't that critical. This is a pretty nice solution! Ethan, the errors you still see from jobs that take more than PRE_TERM_TIMEOUT seconds... I'm assuming that's a separate, job specific issue, like talking to timing out external services/etc?
I noticed the "wait 5 seconds, and then a KILL signal if it has not quit" comment in the code above the new_kill_child method. Without jumping into the code, is the normal process sending a TERM, then forcing a KILL after 5 seconds? Just curious.
Yeah it tends to be from unresponsive external web services that crop up every once in a while. Having a couple of jobs that fail that way isn't the end of the world for us event if we don't retry them.
Yes, the situation you're describing is the RESQUE_TERM_TIMEOUT option which dictates how long the parent process waits to send a KILL signal after it send the TERM signal to the child. On Heroku you want that to be less than 10 seconds (and in practice more like 8 at max) otherwise heroku will terminate both processes with a KILL signal at the same time.
I noticed the "wait 5 seconds, and then a KILL signal if it has not quit" comment in the code above the new_kill_child method. Without jumping into the code, is the normal process sending a TERM, then forcing a KILL after 5 seconds? Just curious.