A small list of best practices we live by when designing Ruby apps.
After working at so many different codebases throughout the years, you start to see some repeated anti-patterns that age poorly.
Lots of new folks joining the industry might not have experienced these problems yet. And the experienced developers can also find it helpful to be reminded of these best practices.
Whether you’re an experienced developer or in the early stages of your career, this post will give you some concrete examples of things to avoid when querying data on your Ruby apps.
Paginate your API responses by default
Add pagination to your endpoints by default from the get-go.
If you have a really good reason to not paginate a specific endpoint, you can easily leave it out as an exception. But having all your endpoints paginated consistently will save you a lot of future trouble.
When APIs have unlimited page sizes, clients will rely on having all that data available. More clients make requests, data grows… then your database starts choking.
To mitigate the problem, you decide to introduce pagination or limit the responses somehow. But that’s a breaking change and people will complain, especially big paying customers.
We can always assume data will expand to infinity. If pagination is introduced in the early stages when there’s not lots of clients, it will make your API more reliable from the beginning.
If you want to take this best practice even further, you could implement keyset pagination (aka cursor-based) when it makes sense.
Curb your enthusiasm queries
Stefanni has worked on fixing slow queries in an extreme large Rails app. Those queries were written a long time ago, when there were at most hundreds of thousands of rows in a given result set.
But now, after the company has had a period of exponential growth, that same simple query could return hundreds of millions of rows. For example, it was common for a query like this to send tens of thousands of IDs as parameters:
Cat.where(id: some_filtered_scope)
#=> SELECT * FROM cats WHERE id IN (?, ?, ?, ?, ...)
Unbounded queries like this will push your database to the limit. If you don’t limit your queries, your database will cry. ;_;
Either add a limit to how many items will be processed at a time, or introduce batches. Otherwise, your app will be at risk of going down if you let users access a specific page or fetch from an endpoint with unbounded queries running around.
To learn more about limiting ranges and lists, we recommend reading Efficient MySQL Performance. To give you a gist of it:
With respect to limiting ranges and lists, there’s an important factor to verify: does the application limit the input used in a query? Way back, I related a story: “Long story short, the query was used to look up data for fraud detection, and occasionally a big case would look up several thousand rows at once, which caused MySQL to switch query execution plans.” In that case, the solution was simple: limit application input to one thousand values per request. That case also highlights the fact that a human can input a flood of values. Normally, engineers are careful to limit input when the user is another computer, but their caution relaxes when the user is another human because they think a human wouldn’t or couldn’t input too many values. But they’re wrong: with copy-paste and a looming deadline, the average human can overload any computer.
Don’t ship code changes alongside migrations
A while ago, Thiago ran a poll on Twitter about this and it looks like not a lot of people deploy migrations separately from code changes:
do you deploy migrations and code changes together, or in separate deployments?
— Thiago Araujo (@thdaraujo) June 19, 2024
We both ship migrations first. Then, after the deploy, we ship new code that relies on the migration. Why?
Some migrations can carry a lot of risk when we’re dealing with large tables on a high-traffic web app.
Sure, it requires one (or many) extra Pull Request but it makes deploys safer when there are schema changes. Shipping migrations first also gives us confidence our code won’t have trouble accessing the new column/table.
If you don’t, it’s possible that while the migration is running in one process/container/pod/server, the web application is still running in another place and reading the stale schema cache. This is also true for workers that implement graceful shutdown (like Sidekiq).
This can cause all sorts of trouble - using wrong columns or attributes, incorrect column information, incorrect defaults, potential data loss, etc.
Shipping migrations separately from code changes will save you from these kinds of problems.
To learn more about shipping migrations separately from code changes, read this issue discussion and this Rails Pull Request.
Catch unsafe migrations during development using the strong migrations gem.
Now, for managing multiple Pull Requests that depend on each other, consider stacking your git branches using --update-refs
.
What about you? What other database performance best practices would you add to this list?