Too Many Signals – Resque on Heroku

pfg · on June 27, 2014

Am I missing something, or is this solution only going to prevent running jobs multiple times in case Resque is being shut down in an orderly fashion with TERM? What if your instance simply dies (which could happen for any number of reasons)?

Solving this problem can be rather complex whenever Third-Party services are involved, but somehow this feels like you've only lowered the likelihood of multiple job executions, which isn't something I'd be comfortable with when it comes to things like credit card charges.

steveklabnik · on June 27, 2014

Said maintainer who hasn't merged this PR yet here. The reason I haven't is because I looked at this, went, uhhhhhh I'm not sure, and haven't had the time to figure out if it's a good change yet. I don't want to change signal handling and then break things for other people.

ejlangev · on June 27, 2014

Solid point. My purpose here was more to start a discussion around this problem and what we did about it rather than to say this is the only or best solution. Also I was sure other people must be having this issue on Heroku and couldn't find any other solutions that would work despite spending a good amount of time searching for one.

steveklabnik · on June 27, 2014

Absolutely! I'm also not saying you're wrong, I'm just saying please don't take my failures as a maintainer as a signal either way. I appreciate the patch.

ejlangev · on June 27, 2014

I'm not sure I follow what you're saying re: preventing running jobs multiple times. If Resque isn't shut down in an orderly fashion then no retry logic would be invoked. Processes would likely be terminated via KILL in this case and thus given no chance to clean themselves up or re-enqueue themselves.

pfg · on June 27, 2014

Somehow I was under the impression Resque provides durability in case a worker process crashes (like [afaik] Gearman and Sidekiq Pro), which isn't the case, so my comment doesn't really make sense for your scenario.

steveklabnik · on June 27, 2014

Resque maintainer here. Ruby + Redis = imperfect durability, as far as I can tell. At least, for the whole system. That said, we've been considering moving to RPOPLPUSH for a long time, but the patch hasn't landed yet. You're right that Resque could be better with regards to durability.

Discussion: https://github.com/resque/resque/issues/758

antirez · on June 27, 2014

> but somehow this feels like you've only lowered the likelihood of multiple job executions

This is basically the best you can do. Imagine the context of a strongly consistent store you use for queues, and N workers. The store is strictly CP. One worker gets task "A", executes it, but does not provide the acknowledge, because it crashes before to ack. In order to guarantee at-least-once-execution property the system has to re-issue the task to another worker after a timeout, causing multiple job execution.

If you want to avoid this, you have as a side effect a failure mode where it is possible that jobs are lost forever, which is a lot worse, since the original problem can be solved by making all jobs idempotent.

Basically Redis-based queues should try to provide guarantees about durability of jobs, and try to provide just best-effort mechanisms to avoid duplication of jobs when possible.

zo1 · on June 27, 2014

Could someone enlighten me and explain why Heroku sends TERM signals to the running processes? Doesn't sound very healthy, that's for sure. Nor something I'd personally tolerate from someone I'm purchasing a "cloud hosting" service from.

Is it simply a case that this is the way that Heroku responds to being told to shutdown an instance? If so, why isn't the managing app that sends the shutdown call to the instance also handling the graceful "shutdown" of the processes on that instance?

ithkuil · on June 27, 2014

I cannot say why Heroku sends TERM signals, but here's why I would do it if I were designing a a PaaS:

* You want to instill the right culture in your customers code, that everything can fail, often, and they have to build software with that in mind. That's because stuff will fail always, and your customers will have to handle failures anyway, otherwise they will blame you.

* It's easier and cheaper to manage your fleet if all you have to care is that ninetysomething percent of your hosts are healthy.

* You can also detect broken machines more easily if you can remove it from the cluster as soon as you suspect it, knowing that no customers will be hurt. Where "broken" can mean anything, sometimes some instances will just run slowly, have bad IO, slow network, whatever, you don't care, you know you can just kill and respawn the containers as long as the total number of dynos meets the requirements and you don't exceed some predetermined rate of churn, which would affect the customer.

* You need to perform maintenance on machines where your run your customers containers/VMs. You can implement live migration, but it has a cost (implementation, management, storage etc), even more true a few years ago.

* You need to perform maintenance within the customers containers themselves; live migration won't help you with that. You don't want to bother your customers with maintenance windows.

* It's easy and cheap to "move" around containers across machines in order to balance load, spread an application across power domains.

ominous_prime · on June 27, 2014

TERM is the canonical way to signal a process to exit gracefully. The management application has no way to determine what other special handling your app may require.

davetron5000 · on June 27, 2014

Heroku sends them as a normal course of operations. Dynos get cycled daily. Why…not sure, but it happens and is well-documented by them that it happens. It's likely impossible to completely insulate against it, but if you design your jobs to be idempotent and safely retriable, rather than try to trap their signals, your jobs will be a lot more bullet-proof

zo1 · on June 28, 2014

"Heroku sends them as a normal course of operations." Wow, I didn't actually know that. Makes me glad I didn't pick Heroku recently for one of my mini-projects. It requires long-running processes.

I guess Heroku "dynos" are more suited for "worker" type jobs, then. In which case, sending the TERM signal to all processes isn't necessarily a really bad way of notifying the worker to shut down. Although, we are in 2014, and I don't see why they can't easily come up with a more robust solution. Even if it's in the form of a "shut-down" process, or giving the worker more than 10s to shutdown.

mperham · on June 27, 2014

Author of Sidekiq here.

I sympathize. I've spent a heckuva lot of time getting clean shutdown working well (and someone just fixed a rare but persistent issue this morning!). There's a lot of edge cases. Steve and the Resque team are doing the right thing: you don't want a fix for one edge case to break another and this stuff is near impossible to test.

neodude · on June 27, 2014

Mike, I'm curious if you think Sidekiq suffers from a similar issue on Heroku, and what the solutions - ideas or already implemented - look like?

mperham · on June 27, 2014

AFAIK this problem is endemic to any job processing system where jobs can take more than N seconds to process. What Heroku does:

  * Heroku sends the TERM signal.
  * The process has 10 seconds to exit itself.
  * After 10 seconds, the KILL signal is sent to terminate the process without notice.

Sidekiq does this:

  * Upon TERM, the job fetcher thread is halted immediately so no more work is started.
  * Sidekiq waits 8 seconds for any busy Processors to finish their job.
  * After 8 seconds, Sidekiq::Shutdown is raised on each busy Processor.  The corresponding jobs are pushed back to Redis so they can be restarted later.  This must be done within 2 seconds.
  * Sidekiq exits or is KILLed.

ejlangev · on June 27, 2014

+1 on that question, I'd be very interested in that as well. My guess would be that since Sidekiq is thread-based rather than process-based it wouldn't have to deal with the issue of all processes receiving the signal at the same time.

dylanz · on June 27, 2014

Great post, and this is something we've faced as well. Luckily our jobs are mainly idempotent, and the ones which are not, aren't that critical. This is a pretty nice solution! Ethan, the errors you still see from jobs that take more than PRE_TERM_TIMEOUT seconds... I'm assuming that's a separate, job specific issue, like talking to timing out external services/etc?

I noticed the "wait 5 seconds, and then a KILL signal if it has not quit" comment in the code above the new_kill_child method. Without jumping into the code, is the normal process sending a TERM, then forcing a KILL after 5 seconds? Just curious.

ejlangev · on June 27, 2014

Yeah it tends to be from unresponsive external web services that crop up every once in a while. Having a couple of jobs that fail that way isn't the end of the world for us event if we don't retry them.

Yes, the situation you're describing is the RESQUE_TERM_TIMEOUT option which dictates how long the parent process waits to send a KILL signal after it send the TERM signal to the child. On Heroku you want that to be less than 10 seconds (and in practice more like 8 at max) otherwise heroku will terminate both processes with a KILL signal at the same time.

JonnieCache · on June 27, 2014

I'm currently trying to decide if I should implement Resque again, or opt for RabbitMQ instead due to this and similar compromises which stem from the ruby/redis combination. What would people say are the major differences, the major pros and cons between the two systems? I'm dealing with simple intermittent, longish running jobs which seem well suited to resque, but I can't shake the feeling that I might be better off with rabbit/0mq/etc.

Obviously resque is closer to a "turnkey" solution and so forth, but what are the real fundamental differences?

cheald · on June 27, 2014

RabbitMQ has its own set of durability issues (see the recent Jepsen writeup on it), but if the data store itself is stable, then it's really very good.

The primary difference you'll notice is that RMQ has an explicit-ack mode. It will send a message to a client, the client processes it and sends an explicit ack (message consumed), at which point RMQ will send the next message. The client can also send a nack (push the job back onto the queue and redeliver it), and if the connection is dropped without the job being ack'd, then RMQ will requeue it and send it to another client.

If you're performing all your state mutations in a transaction or something similar that rolls back when a worker terminates, then you can avoid losing jobs and ending up in invalid state even during non-clean shutdowns.

As far as other notable changes go, you can have multi-queue routing (one message can be routed into multiple queues) and dead letter exchanges (so that TTL expired messages can be sent to a different queue rather than just being dropped). There's a lot more to it, as well; as a message queue, I do think that RMQ is flatly superior to Redis, but Redis has drop-dead simplicity going for it that is really nice if you don't need the extra features RMQ offers.

davetron5000 · on June 27, 2014

The underlying technology doesn't change the fact that Heroku will send SIGTERM to all processes. Resque has good tools for automatic retry, so if you can make your jobs idempotent, configure them for retry and forget about it.

JonnieCache · on June 27, 2014

Sorry, I should have been more clear: I'm running this stuff on a dedicated box, not on heroku. I asked because the thread concerns resque and queueing generally.

driverdan · on June 27, 2014

As someone who literally just implemented Resque for our Heroku hosted app yesterday and is about to add billing to it I'd like to know a little more. What percentage of jobs end up getting killed? Are you flagging those jobs somehow so that you can rerun them and check if the 3rd party service already received them?

cmelbye · on June 27, 2014

He explained this in the post a little bit, but when you deploy to Heroku, scale down dynes, etc., Resque workers will be killed, and if they're processing a job the job will be killed. He also mentioned that they use Resque Retry to retry the jobs that were killed. You just need to trap the signal and perform cleanup, which is typically something you should be doing.

davetron5000 · on June 27, 2014

Even if Heroku is working 100% normally and your code is working 100% normally, your jobs will get killed. The workers get SIGTERM'ed once per day minimum as dynos cycle. The more workers you have in flight on average, the more you will see this. The best thing to do is make your jobs retriable, meaning they are idempotent or otherwise can pick up where they left off. Then use resque-retry to have them automatically retry. That's what we've done, and now the only failed jobs we get are legit issues and not Heroku

chrislloyd · on June 27, 2014

Have you considered using something like https://github.com/chanks/que/blob/master/README.md for critical jobs?

ejlangev · on June 27, 2014

I haven't seen that particular project before, it would solve the problem for changes to the local database but I don't think it's a solution for jobs that talk to external web services. Unless I'm missing something?

chanks · on June 27, 2014

Hi, I'm the author of Que. It's true that you can't really completely solve the idempotence problem for jobs that write to external web services (unless those web services provide ways for you to check whether you've already performed a write - see the guide to writing reliable jobs in the /docs directory), but that's a limitation that'll apply to any queuing system. I'd definitely say that Que, being transactional and backed by Postgres' durability guarantees, does give you better tooling for writing reliable jobs than a Redis-backed queue would in general.

I'm happy to answer any questions you or anyone else might have.

stevewilhelm · on June 27, 2014

Has anyone successfully replaced Resque with RabbitMQ to solve these type of issues?

davetron5000 · on June 27, 2014

We use both and RabbitMQ is not a solution for this problem. Message handlers/listeners are equally susceptible to this problem on Heroku (or, generally, to being killed).

RabbitMQ can be configured to not ack messages where an exception was raised, so if you have a durable store and the code responding to messages is idempotent/retriable ,you are good to go. Such a system can be easily configured with resque jobs using resque-retry, so it's mostly down to how you design your jobs/listeners/message handlers and not the underlying tech

taf2 · on June 27, 2014

It sounds like you should not [edit] cannot [/edit] rely on heroku for things like long running background jobs...

AznHisoka · on June 27, 2014

"Heroku reserves the right to send TERM signals to any dyno whenever it wants. "

I stopped reading right there, and thought to myself: Thank God I didn't choose Heroku as my service provider. Overpriced, and underpredictable.

drunkcatsdgaf · on June 27, 2014

heroku strikes again