12 minutes read
When I look at how we build corporate web sites today, I see way too much complexity in our designs, technologies and the way we approach the problems in general.
Combining a few tendencies that have more and more traction these days, we can come up with much simpler designs that scale exceptionally well.
But first, what are we expecting from our web sites?
We use databases to keep the users preferences and generate a particular version of our pages for each of them, running code from different frameworks to produce the pages on demand.
We build sites that connect to our users data in real time so that any change in the database will be seen instantaneously.
We poll any data that could change at the exact moment we build each page so that a simple refresh will guaranty our users will always have the latest information.
And we do this for Every Single Page they view.
This works well for a few hundred users, but as the numbers grow, we must add more processing to generate these pages. Regardless of the technologies we use (php, rail, python...), we need to add more servers to our solution, not only for redundancy, but simply to scale to the demand.
Soon, the database will become a bottleneck and we will need to scale up, and then out, in order to respond to the load, not because of the size of the data, but because of the number of queries that our system will generate to always push those updated pages to our users.
We will then probably see slowdowns partly caused by the synchronisation issues that our distributed databases require, and sooner or later, we will start using some kind of in memory cache (redis, memcache, etc) simply to keep our response time acceptable.
We will then start having to juggle compromises between TTL (time to live) of the cached data vs performance of the system, with all kind of complexity related to having the same cached data available on all servers, and that doesn't even start to cover the ramp up problems when we have to flush those caches or reboot a server for maintenance.
The next bottleneck will probably come at some point because of the added time needed to generate pages that rarely change, and we will add a reverse proxy (varnish, nginx) to our solution to improve response time, again having to keep that extra layer in sync between servers while dealing with stale data.
Next our very successful site will start generating load on external services that we pull data from, and we will have to add another level of caching between our servers and those service providers, having again to choose a difficult balance between the freshness of our data and the responsiveness of our site.
This might describe most middle size projects (we are not talking amazon size here) we have to deal with in a typical corporate environment. We end up with a dozen VMs, five or more layers, sync and TTL complexity, just to serve a few thousands users.
Let me first get a few principles of the effective simplicity approach out of the way.
Let's start with a simple question. Do you know how many static resources a single instance of nginx can push out per minute?
Let me put it this way. Probably enough to saturate whatever bandwidth we have.
Chances are the only reason we would need two instances would be for redundancy.
Let's imagine that our site is mostly static and served from a single web server.
Let's say we have a couple of thousands of users, and that our static site has 30K different pages, all safely saved on disk. After a while, they will mostly be cached in memory by the OS, making things even faster and performance better.
Each user would have his own pages already generated. A restful approach would only require our solution to make sure he is authenticated for a particular resource. Putting all of his pages under https://mysite/user/joe/... would probably be enough to cover a lot of use cases.
Things that we often consider must be done on demand
Blog entries are usually written once, sometimes a few times, and must be generated by most "solutions" on every page view. A blog page should be generated once when written, and from then on read from a static file from disk. There is no reason whatsoever to do it any other way. But we do. Let's not.
What about comments?
Even if we were regenerating a blog page every time a new comment was added, that would be less by multiple orders of magnitude than the number of times we must read the page.
But we could be smarter than that. We could generate the blog once but add some JavaScript code that would query a json file containing a list of the comments and rendering them at the bottom of the page with client side processing.
Again, we could change that json file ONLY WHEN A COMMENT IS ADDED, making this simple and scalable.
These are just a variation on the blog comments above. Nothing special about them. They might change a few times an hour, NEVER on every read.
This is one of my favorite. How many times do the temperature changes in a day?
10? 15? 20 times?
Can you tell me why we would do the following on every single page view maybe 100 times a minute?
Add any level of caching you want to this "solution", and you only get more "problems"...
What if we could have a simple json file on disk that could be cached in the server, in the user's browser cache, or anywhere along the way where it would make sense to cache it?
What if some JavaScript in the page was smart enough to simply look at the response header from the server and realize that it can use the version in its browser cache because this file will not change more than once an hour?
What if we had a single process, outside our web server, that would, once a minute, check to see if there was any change from the previous minute in temperature or previsions, and if not leave the json file well alone?
That little process would be very easy to write, its work completely independent from the number of connected users, and it would not require that we make any change to the static pages we serve our users, only the temperature json file would need to change.
All kind of schedules have a few things in common.
Finally time to get real!
When most of us talk about real time we usually mean that it it must be fast, with a short delay, in a word, QUICK.
Quick actually is only a part of what real time is. To be considered real time we must also GUARANTEE that the work will be done within a certain delay at a PRECISE time, EVERY time.
This is a far cry from the kind of constrains we usually have with the services we provide with our web sites.
The first time I had to deal with real time was in the video industry. We had a small window (less than 12ms) to move data from memory to a video card, and we absolutely needed to be finished before that window closed. We needed to do that exactly 29.97 times per second (don't get me started with why NTSC needs to be aligned with 30/1001) and every single time we got it even a little wrong, it was a bug, even if it happened only once per hour.
I also worked at a startup where we coded a portable base for cell phones and had to play with much higher frequencies, making the windows even shorter.
I can tell you that I have yet to see a single case of real time in a web application. Actually, by design, TCP/IP pretty much takes real time out of the equation. Auto retries are nice, but they rarely have a place in real time. As anybody who have used the web lately, we know that we can't guarantee delivery of a web app.
So our simplified definition of real time probably is, with customization, the most common reasons we use to justify generating each page for each request.
In almost every case, the misconception that we need real time is just that. A misconception. Nobody will notice that a schedule that changes once a week did change 20 seconds late, or that the outside temperature just went from 20 to 21 with a 25 seconds lag. Especially when most users will not refresh the page for another minute.
So how can looking at the problem differently makes things better?
Well let's look at it from the outside. What about those parts of our site that are public and change very seldom?
This probably covers a sizable part of our site, and there are no reason not to generate them once instead of once per view. I mean seriously, how many changes do you make to your about page in a year?
I am not telling you not to use your favorite language, platform, toolkit or framework, I am telling you we should use those tools to generate the pages once per change (for some of them that is actually once) and not more.
Let's use the blog example again for a more detailed look. In the simple case, the site will need to be updated when a new post is added. So we could be using any tool we want as long as it can generate the page to a file. Plugins for this exist in many cases, and writing one for your blog engine of choice is probably trivial, certainly simpler than getting into the kind of scenario that we looked at in the first part of this text.
So a new blog page needs to be generated, and the front page needs to be regenerated to include the new link. Fairly easy. A very outspoken blogger who makes one new contribution per day will only need to generate 730 pages per year, even if there are millions of page views per day.
Another change happens when a comment is added to a blog page. It is not hard to imagine a service that will receive the POST and either regenerate the blog page, or add the comment to a json file that can be queried user-side and rendered by JavaScript.
In any case, even a very popular entry will probably get something like 20 views for each comment added, so we still save 95% of the page generations.
The astute reader might also notice that the database will be accessed only when new content is added, which means that it won't need to scale with the number of page views, or if you prefer, with the popularity of the site. As a bonus, the database could be offline and most of the content of the site would still be accessible.
Also, no need to add cache for this since nginx can very easily serve all this static content and be our caching solution.
Now this might prove to be a bit more interesting. Let's pretend that our site has a few thousands users, and that each one of them sees a slightly different home page.
How do we effectively simplify this?
By applying the same basic rules we did for the public part of course!
What I am saying here is that in most cases, those personalized home pages won't change very often, and that we should generate them on change (or when a user is created). So yes, we will trade a bit of disk space for simplicity and scaling.
Even if each user had a few different pages, remember that we are talking about HTML pages here, not megabytes of data, so the result should fit in less than a Gig for 4000-5000 users with 3-5 pages of 20-40k each.
But of course, it can't be that simple? User data should be secure and each user should only be able to see his own pages, so how are going to do that with these thousands of files?
Our users still need to login to gain access to customized content, especially if it should be private. No way around that one, we have to somehow generate something on every view, right?
Well, maybe not. Of course we need to run some code when the user logs in, but that again is a single POST per user per session, not per page view.
The usual way to do this is to go to the database to check the credentials of the user, and if they match put a session ID in the DB or in a memory database and save it client side into a cookie. This will be used on every subsequent page view by any server that gets the query to make sure, after another trip to the DB, that the user has the right to see that page. The page will then be generated with data coming from many more database queries and template code crunching.
But what if each user's private content (static HTML files) was place under a different path? (mysite.com/user/joe123/content...)
When the user log in we would encrypt that path inside a cookie using our server private key.
On any subsequent page view for a file in the protected area, we would match that path with the protected URL to make sure that the user actually has the right to see it.
Since we encrypt this path into the cookie, no user could make up this information. Without login or access to our private key, it would be impossible to fake, even knowing the path where the user data is saved.
Note that the only time we hit the DB is to do the actual logging. Once that is done, we don't even use the DB, and all data can be cached.
So what about the content that must be, for need of a better expression, almost real time?
It all becomes a question of how many time per write that information will be served.
If an information can change every 5 seconds and it is imperative that the user see the very last version of it, we could still have it saved to a file and served normally with a very short TTL.
If multiple users request that information in that time span, we just delivered the best and easiest cache possible.
If on the other hand we have only a single user interested in that data, we still 'generate' it only once per view, not a worse case than the typical approach where we would generate it on every view.
The only case where we would do more work is if we generate these files and nobody looks at them.
I would then seriously reconsider the need for their 'almost real time' status...
Well let's start with what we lost, I don't think we will miss it all that much...
Since most of our site is static, we still don't need to scale any part of it.
So what did we gained?
Here is what wikipedia has to say about elegance:
"Elegance is beauty that shows unusual effectiveness and simplicity."
Obviously, we will have some exceptions now and then. But what if we can do this for the biggest part of our sites?
What if we try to look at the problem with effective simplicity to see where we end up instead of starting with a complex solution that we know won't scale easily?
You are the only one who can answer that question.
I know what my answer will be.