Warning, this is a highly opinionated post. Proceed with caution.
In this post I'm going to start a ficticious company - ACME Books Unlimited - and go through a growth of various services until they're hosting many AWS accounts and solving several complex issues.
We're focusing on the benefits of multiple AWS accounts, with a use case focusing on centralizing users while providing least privilege, security, logging, and cost isolation.
About ACME Books Unlimited
ACME Books unlimited has about 20 employees, including some technical and some non-technical employees. They don't have a systems administrator, but their CEOs son comes in on occasion. Their current main product includes the production of ebooks for old textbooks and making them available online. They have a handful of servers on premise that are used to authenticate access, provide access to the ebooks (in reality these are just html versions of the books), and allow staff to make adjustments as required.
The current website was already slow, and they've recently landed a contract to host a lot more content for a third party. This content is expected to be the sum off all their current content times a thousand. To make matters a bit more stressful, there is a provision in the contract with a minimum SLA for response times.
The top brass have okay'd a migration to "the cloud", providing it will offer on demand elasticity as they grow the business. It's our job to move them to the cloud.
Phase #1 - Lift and shift
To start you head to the company's server room to find the current specifications of the current website. In the back you find:
- 6 x CentOS 5 Apache web servers
- 2 x Windows Server 2003 MS SQL 2005 database servers
- 1 x CentOS 6 cron box with scheduled tasks
- 2 x Windows Server 2008 R2 Active Directory Domain Controllers
- 1 x Windows Server 2008 R2 File Server
The hardware looks to be mostly "black boxes" with unknown capacity, put together by hand. The newer three servers were purchased from an actual supplier and look to be in decent shape.
You arrange to interview the CEO's son, Joseph. Joseph reveals the following about the configuration:
- Access to all of the servers have been hooked up via Active Directory username + password.
- The website was written by a friend of his in PHP.
- The cron box pulls some data from the database and places it in files on the file server. There are some jobs that pull from the file server and pushes to the web server.
- The cron box also runs memcache.
- All staff have access to the share on the file server. This is where they upload new content, make edits, and see data on "hits".
Following the interview with Joseph, you arrange a focus group with some of the staff to hear about how things work from their perspectives:
- Mary the receptionist explains that she's accidentally delete the website before.
- Jon complains that his staff can't see their updates right away, and that it's "usually" the next day that they see their changes. But sometimes it can be a few days.
- Lisa explains that visitors to the website typically complain about how slow and unusable the website is.
With all this in mind you flesh out the costs for approval. Currently you're thinking of just a raw conversion from their on premise. You come up with the following transition plan:
Part 1 - minimal footprint
|Purpose||Instance Type||Instance Count||Hourly Rate||Expected Duration (hrs)||Expected Cost|
|web||t2.micro||1||$0.013 / hr||720 hrs||$9.36 / m|
|db||db.t2.micro||1||$0.017 / hr||720 hrs||$12.24 / m|
This means they can POC the move for under $25 per month if you include some cost for moving and syncing the current spreadsheets from the on premises file server to S3. Note: I'm excluding calculating the EBS costs.
We'll assume that everything moves over nicely and is confirmed working. The staff test the new version and you get approval to cost out and scale up a full platform.
Part 2 - fully AWS
|Purpose||Instance Type||Instance Count||Hourly Rate||Expected Duration (hrs)||Expected Cost|
|web||c4.large||3||$0.105 / hr||720 hrs||$226.80 / m|
|elb||n/a||1||$0.025 / hr||720 hrs||$18.00 / m|
|db||db.m4.large||2||$0.175 / hr||720 hrs||$252.00 / m|
|memcache||cache.m3.large||2||$0.182 / hr||720 hrs||$262.08 / m|
This estimate brings the bill to approximately $750 per month. The CEO is less fond of this number, but agrees to give it a try. After testing it out they notice that the website is much, much, much faster. You're given the green light to make the DNS switch.
Everything goes well, and you do a couple other miscellaneous tasks, including:
- Enabling CloudTrail.
- Enabling Config.
- Ensuring auto scaling is running for the instances with a custom AMI.
- Enabling ELB logging.
- Enabling S3 logging.
- Enabling S3 versioning and life cycle policies.
- Ensuring that the services are all using IAM roles.
Since the company isn't sure about the new platform yet, you're paying for everything on demand, but plan to move to reserved instance pricing when it's a reasonable proposition.
Phase #2 - An influx of business
Remember that contract that was mentioned in the beginning of Phase #1? Well they managed to keep everything going and met the required SLA. Now they've managed to sign 6 new contracts, some even bigger than the first.
You've even been given an official budget! Up to $10k per month is acceptable although they'd rather see it closer to $3k per month as long as performance doesn't suffer.
The new requirements are:
- Increase speed for users in another region.
- Increase speed globally.
- Have a failure plan for bad code deploys.
- Automatically scale based on load, while staying within specified limits.
- They want to add a new feature to output PDF versions of courses on demand, and share a link to a private download. This is a contractual obligation to on board a new client.
To meet these, we need to clone our current stack to another region (by copying the AMI over and setting up the same resources), configure CloudFront as a CDN pointing at our load balancer, and have a fail over record configured in route 53. We can also configure the auto scaling group to scale based on average load balancer latency or similar metric. The last objective is a bit harder.
For the new feature we'd have to pick an implementation. In this example we're going to have the web server integrate with the AWS PHP SDK, wherein the client can have the web servers create an SNS message that triggers a Lambda function, which in turn processes the request and creates a signed url that is emailed to the client. This is a quick and dirty approach that works if the Lambda function can process the request in a timely manner.
An alternative approach would be to call an SQS service, which gets processed by a docker image hosted on ECS and auto scaled depending on the queue size.
For this example, we're going to assume that there's very few invocations in general, and the function takes 8 seconds to process. This means it's probably cheaper to go the Lambda route for now.
Phase #3 - Increasing the number of services
You should know the drill by now - more contracts, more traffic, more features. The company has really taken off and has started doing more internal development with a focus on adding value to their current implementations.
They've also become addicted to the ease of launching stuff on AWS.
So what have they done:
- Connected a site to site VPN to their office.
- Added an internal wiki in AWS.
- Added a jenkins host for deploying changes.
- Added a web service for internal users to administer the public web service.
- Added a feature for clients to customize the look and feel of the web version of their content.
- Added a feature for clients to create and update a provider profile with custom details.
Some where in this they've spawned numerous development environments that are running next to the production instances. They've hacked apart your original IAM role, basically giving it administrative privileges.
We're not going to do anything this phase. It's just to show organic growth.
Phase #4 - Cost and access accounting
Okay, so after the business has seen exponential growth we're seeing a lot of costs climbing and a lot of badly configured access. We've been called in again to help with the following objectives:
- Separate out costs. They want to have a dev, qa, prod, and internal environment.
- All the different services should be network separated.
- All access should be logged to a central location.
- Billing should be aggregated in one spot.
- Other than the separate root users, each user accessing the accounts should only have one account that works on all of the accounts.
To accomplish this:
- We're going to shard each group into a separate account. Each service will be network separated with it's own VPC. And each instance / service (RDS, Elasticache) will be tagged with billing information.
- Where applicable we're going to send all logging information to another account, so CloudTrail, Config, ELB, S3, and OS logs will be available in a centralized location.
- We're going to create a billing account that's accessed by the company's accountant. All other accounts will use consolidated billing to the billing account.
- For user's to access our accounts we're going to create an account where we store our IAM users. These users will have essentially no privileges except to assume roles in other accounts. The assume roles in each account will provide IAM permissions.
This will result in a total of seven AWS accounts being created. Three for different stages of their platform, one for internal systems, one for logging, on for billing, and one for users.
To paint a word picture - as a user you log into the one account on the console or the command line. From there you can assume a role to another account depending on what you're working on. The key here is that you can restrict which roles a user can assume and define what the roles are clearly. If a user should have access to an environment or shouldn't becomes easy to define, whereas when it's all in an account you need to be very particular about your policies.
Billing tags can be used to distinguish what's being used by what services. A handy thing to do for cost control.
Anyways, this covers their goals for now, next section should be a fun one.
Phase #5 - Auditing
So after we helped them out last time, they landed a financial contract that requires them to get a normal compliance certification (SOC1 or SOC2 type certification for example). Now they need our help to prepare for the audit.
In the previous phase we actually did several things that are going to help them out. I'm just going to focus on the technology here, but be aware that most of these compliance certifications require organizational controls as well as technical controls.
What we've got so far:
- A secure account that doesn't run anything and just holds our users.
- Another account that controls and audits our costs.
- Another account for internal systems that only a select few need access to.
- Three accounts that host our platform with separate VPCs for different components and limited IAM functionality.
- Any server with a role is assigned an IAM role with restricted permissions as needed to perform that role.
- Resources are tagged with billing information.
- All logs are sent to our logging account, where only a select group of users have access for auditing purposes.
We're in pretty good shape for a basic compliance audit. Things to consider:
- Ensure all users are using MFA for console access at a minimum.
- Security groups are set to only allow a minimal amount of traffic between hosts. Bonus points if you set up a proxy or something that only exists to proxy requests for yum or similar.
- Don't log in as root / ec2-user / any other single user account. Every person logging in should use their own set of ssh credentials. And not the username + password kind.
- Enable MFA to delete on logs and S3 log versions. Basically only root should be able to delete audit logs, and only if root has the MFA token.
- Use CloudWatch Alerts to monitor your log files and alert a SNS topic as needed.
Otherwise the main thing for audits is the ability to generate reports that the auditors can understand. This is where scripting and the AWS CLI or AWS SDK can be pretty useful. A really robust approach might involve having a collection of Lambda functions that run scripts to generate reports on a scheduled basis storing them and emailing them out as desired.
So back to the original thought here - we just grew a company through several stages of AWS growth. They're now a big deal, with certifications, a bunch of big clients, etc. We helped them solve several issues and we focused on the benefits of multiple AWS accounts, with a use cases focusing on centralizing users while providing least privilege, security, logging, and cost isolation.
Now to the highly opinionated part warned about at the start: as you grow your AWS account it can get messy. Clean it up by practicing good account hygiene or split it out so that you can adequately scope the mess. When things get messy it's harder to spot that one configuration issue. On that note, CloudFormation to version control infrastructure changes!