Global overview of the Seams-CMS infrastructure

banner

Global overview of the Seams-CMS infrastructure

Creating a high available content platform is not easy. We must ensure that the data is always available, and cannot suffer under a large amount of traffic. We must also take into account things like privacy, encryption, backups, etc. This blogpost will give you some insights into our infrastructure and the way we solve some of the problems when creating a SaaS platform.

Infrastructure

Our infrastructure is located in the AWS cloud. This makes it easy for us to deal with infrastructure as we see fit. We use EC2 machines instead of containers. We could in theory move to containers and use an orchestrator like kubernetes. But for now this is a bit overkill, with added complexity of maintaing the kubernetes cluster (even though much of it can be taken care of by AWS through their EKS system). We have many different machines for different purposes. Our production platform runs in a dedicated network (VPC), while development machines (for instance, machines for doing end-to-end testing are run in a separate network.

Data storage

The heart of our system is a MongoDB cluster that holds all content, content-types, relationships between content and asset meta information. Early in the design phase, we had to choose between either MongoDB or PostgreSQL for storing our content. Since we already had some experience with handling (lots) of documents in MongoDB we opted for MongoDB. However, we've created a layer in our code that would allow us to switch easily between MongoDB and Postgress in case MongoDB did not meet our expectations. Fortunately, MongoDB holds up perfectly, but the layer is still present in our code.

Our MongoDB cluster is completely maintained by ourselves. It consists of multiple machines each serving as a node in a cluster. Once a node disconnects (because of network issues, or the machine goes down), the cluster will still function. This way we can lose half our nodes before the MongoDB cluster is not available anymore.

We continuously monitor MongoDB cluster performance and are automatically notified when something goes wrong.

Another important thing, are the backups of our content data (well, your data, actually). We do this by snapshotting the disks on each of our MongoDB nodes. This is a very fast and non-intrusive operation so we have a complete snapshot of the data at each hour. If something goes wrong within that hour, we still have binary logs that allow us to replay the data from the last hour onto the cluster.

These snapshots also allow us to quickly scale the cluster as well. We can simply create a new machine based on the latest snapshot, so the new node only needs to catch up the last hour.

Assets

Another large part of the content is the assets from the different users. These assets are stored on S3 buckets, which provides availability since the data is stored in multiple different places at once. We do not do any backups of this data because of this availability, but we are looking into the possibilities (and costs) of storing this data offsite.

Accessing this data is done through Amazon's CloudFront CDN system. This allows us to provide you the quickest access to your data (based on geolocation for instance), and we will be able to convert images into the correct sizes requested. This is done through a secondary system that sits between the CDN and the S3 buckets. If you request an image in a certain dimension, this system will fetch the image from the S3 bucket, scale it to the proper size and saves it on the CDN. The next time the same image is requested with the same dimensions, it will be automatically served from the CDN.

Secondary data

We have so-called secondary data as well. This is data that is not content-related per se but still needs to be available. This data consists of user account info, workspace, permissions, API keys, etc. Since this data is not accessed a lot, it is stored in a MySQL server system running on Amazon's RDS. We create backups of this data each hour as well and store them encrypted and off-site.

Accessing content

In order to combat spikes in traffic, we offload our data to a Content Delivery Network (CDN). In our case, this is Fastly. Whenever you query one of our APIs, you will connect to one of the servers of the CDN. If the data is present, it will be served from there directly, otherwise, it will fetch the data from our servers. This means that our servers are only queried whenever data is changed, or whenever new data is requested.