Alright, there’s been enough speculation (or plain provocative commentary) on why Relational Database Management Systems-backed (RDBMS) queues are a nightmare—to the point that you’d strongly consider holding back your opinion to use them even where its real perils are far from manifesting. At Locus, solutioning follows first principles, and evangelistic projections of the right tool are outweighed by the fitment of the right tool for the job. So let’s find out how this stereotypically tagged “bound-to-fail” solution has been powering async processing at Locus.
“It doesn’t scale.”
The argument is simple — a relational database is not designed to be used as a queue. As such, the solution already has a hacky appeal. While putting in jobs that will run sometime in the future and querying for the next available job might not seem too twisted on the surface, the volume and pattern of these requests in the context of a job queue uncover problems that underline the inapplicability of an RDBMS to the problem of queuing.
Two things pose a challenge:
- Contention for the next job – Everyone wants it; only one can have it.
- Lack of specificity of time – Everyone wants the next “available” job. This effectively means that queries have at least one range-based key that all consumers will have the same upper bound for.
The availability of specialized tools catered to various forms of queuing/scheduling typically drives out any thoughts of building a queuing system around an RDBMS. Sure, most scheduler libraries have bindings for RDBMS backends, but they’re used more as a fallback than a preference.
Both arguments are solid. The objective of this post is not to refute them. It is to highlight why table-backed queues still turn out to be a dependable solution for all async processing at Locus.
Mountain To Climb – What we’re solving with table-backed queues
Before exploring why the solution works for us, let’s explore the problem and constraints we’re working with.
We run a variety of asynchronous jobs, ranging from seconds to hours in execution time. These jobs are highly critical – missing out or delaying even a single job could pose serious operational challenges for our customers. This effectively means job execution should be highly reliable and fault-tolerant. These jobs are scheduled throughout the day for immediate or deferred execution.
Mountaineering – Why it works for us
Our reliability and fault-tolerance requirements make RDBMS an apt choice. It provides a framework in which the reliability of other components in the system can be discounted to a reasonable extent, at the expense of limiting us to the constructs and tools it provides, which are familiar to the entire team.
Ordering guarantee not required
Ordering is critical to queues, but not schedulers. While we do call our service a “queue”, we don’t associate strict ordering requirements with it. The only detail that matters to us is when a job will be available.
One of the key requirements for the kind of scheduling we do is that each job must be executed exactly once. This has a two-fold implication:
- There must be some mechanism to lock a job that’s being picked up.
- There must be some mechanism to force-release locks on “abandoned” jobs.
Both of these are easily achieved by an RDMS. The former is made possible by exclusive locks and the latter by adding a “visibility timeout” that makes messages visible to workers after being abandoned. For jobs that are processing successfully, this timeout is periodically updated by means of a heartbeat.
Familiarity is often overlooked (but usually always sided with) when choosing technology to back a certain solution. While it’s almost always good to constantly explore the “latest and best”, the effort and time invested aren’t justified unless the solution aligns with the business objectives of the organization. Engineering must be utilitarian in that familiar technology saves bandwidth and allows for more efficient debugging, profiling, and experimenting. Of course, this is underlined by the fact that the technology under scrutiny must be appropriate for the use case.
Multi-cloud is a requirement for most enterprise companies. This makes over-reliance on the services, constructs, and contracts provided by any single provider a problem when porting your services to a different provider. For something as critical as our messaging backbone, we decided to homebrew a solution that could work effectively across all stacks regardless of their individual nuances. This came after struggling with inconsistent contracts that parallel services offered in terms of the features we relied upon.
No head-of-line blocking since topics contain similar messages. A typical problem with table-backed queues is whilst picking jobs in bulk, the batch can experience head-of-the-line blocking if it has lightweight jobs behind heavyweight ones. We get around this by using separate queues for jobs of different types.
On a high level, this is what our async workflow looks like:
- Internal services seed messages to various queues via a thin API layer.
- Dumb workers poll the API layer for available messages.
- A heartbeat is emitted periodically to indicate liveness.
- Once processed, the job is deleted from the table.
For reference, here’s a simplified representation of our message schema:
Based on this, let’s try to answer some of the questions we’re interested in:
- Give me some available jobs. This happens in two parts:
- Acquire a timed lock on some available jobs:
UPDATE JOB SET LOCK_KEY = <lock key>, LAST_HEARTBEAT_AT = NOW(), TIMEOUT = TIMESTAMPADD(SECOND, <timeout duration>, NOW()), RECEIVE_COUNT = RECEIVE_COUNT + 1 WHERE QUEUE = <queuename> AND AVAILABLE_AT <= NOW() LIMIT <size>
- Fetch jobs by lock key
- How many jobs are pending for a given queue:
- This is a simple query for messages with a RECEIVE_COUNT of 0.
- How many messages are currently processing?
- Translates to all jobs with RECEIVE_COUNT > 0
- What’s the age of the oldest available message in a queue?
- What happens if a worker dies while working on a job?
- Every worker emits a heartbeat to indicate job progress. With each heartbeat, the timeout for the lock is extended, and the message is stated to be available after that timeout.
- Thus, if a worker dies halfway through, the message will be visible to other workers post the expiration of the lock.
SELECT QUEUE, TIMESTAMPDIFF(SECOND, MIN(AVAILABLE_AT), NOW()) AS COUNT FROM JOB
The setup has been very stable for us over the past year. Usage has organically scaled 3x over the last six months without any major optimizations. While our regular peak TPS for receive queries is 250, temporary scale-outs or spiky workloads haven’t revealed any performance issues. Latencies have been very consistent in general.
It hasn’t been all roses, though. We did have our own share of problems in the initial days. Consider this:
- All our workers try acquiring exclusive locks on one or more eligible jobs at the same time.
- Jobs are mostly added for immediate processing.
This implies there’s a lot of contention for new jobs, and as mentioned earlier, all workers querying at a given time will have the same upper bound (i.e. the current timestamp). At any given time, the query acquiring a lock on the latest record will also lock the range between the upper limit and the scheduled time of the record, making other queries with overlapping gaps wait. Note that this doesn’t interfere with the addition of new jobs since jobs aren’t added for past timestamps.
There are a few things that we have tried here:
- Picking up jobs in bulk – This doesn’t address the core problem but optimizes the total number of messages being received, i.e. more throughput per lock acquired. Since our topics have similar-sized jobs, this is a viable solution for short-lived tasks.
- Using a truncated timestamp for querying – This was aimed at reducing deadlocks due to gap locking by querying over discrete ranges. Using second and minute-level truncation still led to deadlocks, though, so this didn’t seem to help all that much for our query pattern.
- Retries at the database layer – We chose to stick to retries instead of relying on complicated logic to deal with deadlocks. This has successfully alleviated our deadlock issues without putting any significant stress on the database.
This setup has allowed us to build async workflows across clouds without any customizations or changes in the contract. Operational issues have been pretty minimal, and having end-to-end control over the workflow has allowed us to fine-tune the system based on evolving requirements. While we don’t have stress tests in place yet, operational metrics indicate this will be very viable for the mid-term. We have had a pretty educative experience building this, and the sheer operability of it is something we hadn’t anticipated. So make sure you don’t rule table-backed queues out the next time you go whiteboarding!
First Published Source: Table-Backed Queues as a Sustainable Messaging Backbone