How do you scale an AWS (Amazon Web Services) infrastructure? This article will give you a detailed reply in two parts: the tools you can use to make the most of Amazon’s dynamic approach, and the architectural model you should adopt for a scalable infrastructure. I base my report on my experience gained in several AWS production projects in casual gaming (Facebook), e-commerce infrastructures and within the mainstream GIS (Geographic Information System). It’s true that my experience in gaming (IsCool, The Game) is currently the most representative in terms of scalability, due to the number of users (over 800 thousand DAU – daily active users – at peak usage and over 20 million page views every day), however my experiences in e-commerce and GIS (currently underway :o)) provide a different view of scalability, taking into account the various problems of availability and data management. I will therefore attempt to provide a detailed overview of the factors to take into account in order to optimise the dynamic nature of an infrastructure constructed in a Cloud Computing environment, and in this case, in the AWS environment.
The tools
I will provide examples of tools I used in my various production experiments and which won me over with their efficiency. However, it’s not so much the tools themselves which are emphasised, rather the underlying features that must be installed to fully exploit the AWS.
What is the difference between an AWS and a standard infrastructure?
There are two replies:
- Firstly, the AWS enable the Hardware and Infrastructure components to be handled via the code (through the APIs): this gives them momentum which needs to be intensified by correctly selecting your tools. Amazon also offers a certain number of technical guarantees with their services, which should be used to full advantage in order to concentrate on what is essential.
- The second answer should be: ‘None!’ However, it must be acknowledged that often the cosy aspects of ‘home’-hosted infrastructure can lead to a certain lack of rigor on many issues. The fact that Amazon, with its AWS, offers a dynamic and volatile solution for the EC2s, compels you to install mechanisms which should be standard.
Centralised configuration management
In order to instantiate multiple servers (EC2) on the fly to counter a load peak or perform occasional procedures, we have to use a centralised configuration management tool. This has multiple uses:
- Rapid instantiation of a machine corresponding to a pre-defined type: Web server, cache server, etc.
- Ensuring uniform configuration of the entire set of instances of the same type at all times.
- Capitalising on the expertise: packages, configurations, services etc. for each type of instance are contained within a descriptor. This enables new users (and others) to learn the composition of the servers simply by reading the associated descriptor(s.)
Puppet is a very attractive solution for this task. Puppet architecture comprises:
- A Puppet Master, guardian of configurations. It contains the descriptors of the packages to be installed, configuration files to be pushed and services to be started up for a new EC2 instance in order to prepare it according to the type of node or nodes with which you have associated it, or simply to keep an existing instance updated.
- Multiple Puppet clients installed on EC2 instances. The client regularly polls the master to check that his configuration is up to date. It is also possible to trigger the clients from the master to initiate an urgent update.
It is thus very simple to mount a new instance of a given type, and to be sure that from one server to another, the configuration is identical in every way. It should be noted that the Puppet Master descriptors run on a node system, and a server may reference more than one node. You will therefore obtain a very modular configuration.
Puppet Module Descriptor
The centralised configuration manager is indispensable on a scalable infrastructure, which can have a large number of instances, some of which should be mounted rapidly, in order to reply to user requests during load peak.
‘But’, you may say, ‘you do have to install the client on the new instance, don’t you?’ Oh yes, all the more so as there is a system of certificate request/acceptation between the client and the master, so that not just anybody can access the configurations. ‘There’s nothing simpler!’ I reply. Instead of using an HMI (Human Machine Interface) such as ElasticFox to connect to the Amazon services APIs, you need simply to connect with command lines via a script which instantiates an EC2, creates if necessary an EBS volume and associates it, connects with SSH on the new instance and installs the client, connects on the master and accepts the certificate request, reboots the client and ‘Bingo!’ The instance installs and configures itself. In general, the reactive and dynamic capacities inherent in EC2 and AWS require it, otherwise it would be an under-exploitation, or even a misuse of the tool.
And what about the AMIs (instance-store root devices) or the EBS root devices (note that for the latter there is not yet a great choice of templates)? It is however possible to build an AMI or even an EBS root device with the Puppet client installed on it (and to use, if necessary, the user-data at the launch of the instance in order to set the parameters if a ‘static’ configuration on the AMI is not sufficient). However, in a production environment, the AMI and the EBS root device do not allow you to replace a Puppet: if you have to modify the AMI every time you change the parameters and redeploy the EC2… you’ll never finish. The AMI is a good tool for instantiating a basic template, but not for maintaining a production infrastructure in good working order.
To sum up, it is even possible to envisage auto-scalability (a more complete service than the one offered by Amazon’s Auto Scaling, based on CloudWatch), which would amount to starting up this script when a value on a monitoring graph (metrology) exceeds a threshold, thereby instantiating a new instance to weather the peak period, an instance which would be ‘terminated’ after dipping below another threshold on the aforementioned graph. The possibilities are enormous. However, the cost of the final elements to be installed on a large production infrastructure to access the auto-scalability function is considerable (exponential complexity). If you get the chance, it’s an excellent handiwork, otherwise you’ll have to ‘settle’ for an easily scalable solution whereby you simply launch a script manually (or else at a fixed time based on the study of the metrics produced by the metrology, which is a satisfactory alternative). If necessary, manually adding one or two references in a file is a perfectly reasonable alternative.
Execution of automated tasks in parallel
Because of the variation in the number of instances according to the load of a scalable infrastructure, and particularly in the potentially large number of these instances, you cannot envisage carrying out maintenance and deployment operations on an individual basis for each machine (or manually, which is even worse!). This would not only be too time-consuming, but would also lead to errors. It is therefore essential to use tools specialising in automation and parallel task execution, such as Capistrano, which has numerous advantages:
- Scripting a certain number of tasks, whether complex or not (deliveries, backups, site publication/maintenance, etc.) executing them rapidly in parallel on X instances with a single command.
- Ensuring the reproducibility of a task.
- Capitalise on know-how (Cf. Puppet).
Capistrano Task Descriptor
This tool is very practical and easy to use. It executes tasks in parallel (by connecting with SSH – don’t use the ‘root’ key of your EC2 instances in the tool, it’s classier to use a dedicated key…) on one or a group of servers defined by roles.
This tool should be used as a supplement to Puppet and not a replacement, as they do not have the same targets:
Objective
Puppet – Uniform configuration of the server pool.
Capistrano – Task reproducibility and in-parallel execution.
Operating mode
Puppet – Regular pull requests by clients (or even push from the master).
Capistrano – Occasional tasks (manual request or by cron).
Orientation
Puppet – Pre-defined objects / notions such as ‘Package’, ‘Service’, ‘File’, etc.
Capistrano – Generic commands such as ‘upload’, ‘download’, ‘system’, ‘run’, etc.
Target
Puppet – Infrastructure and services.
Capistrano – Services and applications.
Note that Webistrano is an HMI enabling access to the Capistrano features and introducing the notion of project management by managing access to the tasks according to profile, to trace who deployed what on which server and to send email signals in response to certain actions. I find this joint operation with project management very interesting.
Monitoring
This is one of the crucial elements of a scalable architecture, if not the main one. It is essential to have the metrics to hand, precisely so you know when to scale. This enables you to identify the clogged points of the infrastructure, to analyse your site traffic and to make deductions about user behaviour. It also enables signals to be triggered.
The provision of metrics is more representative of metrology than of supervision. We will see that, in the context of an AWS infrastructure, metrology has greater added value than supervision (even if the latter remains indispensable). Speaking of which, there follows a clarification of the two monitoring disciplines that you need to recognise:
- Supervision verifies the state of a host or a service and sends out an alarm upon detecting any abnormal behaviour (over-long response time, status NOK, etc.) and requires immediate action on the part of the interface partner concerned. An alarm must signify that the host or the service is unusable (critical) or may soon be (warning) and must not send back ‘simple information’ which would obstruct the visibility of real incidents.
- Metrology enables data to be archived, and if necessary, processed or filtered, before it is presented in the form of graphs or reports. This data enables corrections to service development or configuration to be carried out after the event, in order to optimise the aforementioned services, and equally to define the resources to be attributed to them in the most effective way. This aspect of monitoring is just as important, for it will enable the service to be improved (and thereby the users feedback) and also to reduce costs. The metrology results should clearly highlight the improvements to be made by correlating the values recovered.
Quite a number of monitoring tools are available on the market, each with its pros and cons, some more supervision-oriented, others more towards metrology. Among them: Centreon/Nagios, Zabbix, Cacti and Munin. Personally I use Centreon/Nagios mainly for supervision and Cacti for software-oriented metrology with pre-packaged graphs such as for Apache, MySQL, Memcached, etc. I tried Munin, it lacks some of the useful features of Cacti but it is really easy to use and it matchs very well with Puppet (configuration based on descriptors and symlinks). I’ve also heard many good things about Zabbix (comprehensive range of features).
Thread Scoreboard Apache - Yearly - 1 Day Average
It should be noted that these tools operate mostly (Centreon, Cacti and Munin) on RRDtool (Round-Robin Database), a database management tool which also enables a graphic representation of the data contained in the base. The big advantage stems from the fact that the stored data volume is minimal, whether for storage of several months, or even years. This is made possible by calculating the averages of the time periods (from 1 minute to 1 day): the recent data is exact, whereas the older data is approximate. This is very practical, as you have both the recent exact metrics and you keep the historic, a representation of the evolution of your architecture. This is excellent for visualising the evolution of the key metrics and understanding the impact of the evolution of the components of the aforementioned infrastructure, or the applications supported as and when the different releases are delivered.
As for Cloud Computing-type infrastructures, metrology is the discipline with the highest added value, as it will enable feedback to your automation tools.
Keep in mind also that it’s interesting to be able to integrate dynamically the new instances installed and configured by Puppet in the monitoring tool. Give priority therefore to monitoring tools with modular configuration or simple access via APIs.
Managing centralised logs
It’s undeniable: the more instances you have, the more scattered logs there’ll be on the various instances. Moreover, sending all the logs to the syslog daemon becomes difficult, as the infrastructure components do not necessarily manage it very well (redirection, respecting the syslog protocol, etc.), not to mention the application(s) / functional block(s) which can log in the various files on the server.
Consider trying a tool such as Syslog-NG: it comprises a server and several clients, in fact it’s the same tool but with different configurations. What’s more, it is capable of regularly checking the log files (of business applications for example) and transmitting only the differential, even if it doesn’t respect the syslog protocol:
- The client recovers the files/pipe/unix-dgram/etc. logs and sends the information to the network (TCP/UDP).
- The server recovers the information from the network (TCP/UDP) and registers it in files, databases, etc.
Syslog-NG Architecture
We’re talking about only minimal differences in terms of configuration. The component can also act as a relay on more complex infrastructures by receiving the network logs from diverse clients and sending them back to the network for a server.
It is possible to apply filters before sending the logs to the network in order to send only the essential ones. Next, filters can also be applied at the server level in order to manage differently the logs coming from clients, according to their urgency, sender program, source host, etc.
It’s a simple and non-intrusive tool able to take into account the logs of scattered applications on a server (even if these logs do not respect the syslog protocol format – there is however an advantage to respecting this protocol: greater accuracy of the facilities and priorities).
Keep in mind, the same as for monitoring, that it is interesting to be able to integrate (and withdraw) dynamically the new instances to be managed at the logs system level.
Conclusion
Over and above the tools presented here, it is the underlying functions and capacities which are important. What stands out are the following:
- Tools such as centralised configuration managers, tools for the execution in parallel of automated tasks and also log managers, will become even more important. Working without them on such infrastructures is no longer conceivable.
- Monitoring is still and will remain a key value, but metrology allows you, with the analysis of key defined metrics in your infrastructure, to sustain the automation tools.
- Automation is indeed the key word promoted by AWS: the accessibility of the Hardware and Infrastructure resources via code allows us to move towards ever more autonomous architectures. However you must always position the cursor correctly, in the knowledge that the final stages of complete automation are the most expensive.
I hope that this technical feedback will interest you as much as I enjoyed working on these infrastructures. I will continue in the second part of this article: Choosing the correct architecture model for scalable infrastructures.
Frédéric FAURE @Twitter @Ysance