Table of Contents
AWS in General
|Specific AWS Services||Basics||Tips||Gotchas|
|RDS Aurora MySQL||?||?||?|
|RDS Aurora PostgreSQL||?||?||?|
|RDS MySQL and MariaDB||?||?||?|
|RDS SQL Server||?||?||?|
|Security and IAM||?||?||?|
|VPCs, Network Security, and Security Groups||?||?||?|
Figures and Tables
- Figure: Tools and Services Market Landscape: A selection of third-party companies/products
- Figure: AWS Data Transfer Costs: Visual overview of data transfer costs
- Table: Service Matrix: How AWS services compare to alternatives
- Table: AWS Product Maturity and Releases: AWS product releases
- Table: Storage Durability, Availability, and Price: A quantitative comparison
Why an Open Guide?
A lot of information on AWS is already written. Most people learn AWS by reading a blog or a “getting started guide” and referring to the standard AWS references. Nonetheless, trustworthy and practical information and recommendations aren’t easy to come by. AWS’s own documentation is a great but sprawling resource few have time to read fully, and it doesn’t include anything but official facts, so omits experiences of engineers. The information in blogs or Stack Overflow is also not consistently up to date.
This guide is by and for engineers who use AWS. It aims to be a useful, living reference that consolidates links, tips, gotchas, and best practices. It arose from discussion and editing over beers by several engineers who have used AWS extensively.
This is an early in-progress draft! It’s our first attempt at assembling this information, so is far from comprehensive still, and likely to have omissions or errors.
Please help by joining the Slack channel (we like to talk about AWS in general, even if you only have questions — discussion helps the community and guides improvements) and contributing to the guide. This guide is open to contributions, so unlike a blog, it can keep improving. Like any open source effort, we combine efforts but also review to ensure high quality.
- Currently, this guide covers selected “core” services, such as EC2, S3, Load Balancers, EBS, and IAM, and partial details and tips around other services. We expect it to expand.
- It is not a tutorial, but rather a collection of information you can read and return to. It is for both beginners and the experienced.
- The goal of this guide is to be:
- Brief: Keep it dense and use links
- Practical: Basic facts, concrete details, advice, gotchas, and other “folk knowledge”
- Current: We can keep updating it, and anyone can contribute improvements
- Thoughtful: The goal is to be helpful rather than present dry facts. Thoughtful opinion with rationale is welcome. Suggestions, notes, and opinions based on real experience can be extremely valuable. (We believe this is both possible with a guide of this format, unlike in some other venues.)
- This guide is not sponsored by AWS or AWS-affiliated vendors. It is written by and for engineers who use AWS.
- ? Marks standard/official AWS pages and docs
- ? Important or often overlooked tip
- ❗ “Serious” gotcha (used where risks or time or resource costs are significant: critical security risks, mistakes with significant financial cost, or poor architectural choices that are fundamentally difficult to correct)
- ? “Regular” gotcha, limitation, or quirk (used where consequences are things not working, breaking, or not scaling gracefully)
- ? Undocumented feature (folklore)
- ? Relatively new (and perhaps immature) services or features
- ⏱ Performance discussions
- ⛓ Lock-in: Products or decisions that are likely to tie you to AWS in a new or significant way — that is, later moving to a non-AWS alternative would be costly in terms of engineering effort
- ? Alternative non-AWS options
- ? Cost issues, discussion, and gotchas
- ? A mild warning attached to “full solution” or opinionated frameworks that may take significant time to understand and/or might not fit your needs exactly; the opposite of a point solution (the cathedral is a nod to Raymond’s metaphor)
- ??? Colors indicate basics, tips, and gotchas, respectively.
- ? Areas where correction or improvement are needed (possibly with link to an issue — do help!)
When to Use AWS
- AWS is the dominant public cloud computing provider.
- In general, “cloud computing” can refer to one of three types of cloud: “public,” “private,” and “hybrid.” AWS is a public cloud provider, since anyone can use it. Private clouds are within a single (usually large) organization. Many companies use a hybrid of private and public clouds.
- The core features of AWS are infrastructure-as-a-service (IaaS) — that is, virtual machines and supporting infrastructure. Other cloud service models include platform-as-a-service (PaaS), which typically are more fully managed services that deploy customers’ applications, or software-as-a-service (SaaS), which are cloud-based applications. AWS does offer a few products that fit into these other models, too.
- In business terms, with infrastructure-as-a-service you have a variable cost model — it is OpEx, not CapEx (though some pre-purchased contracts are still CapEx).
- AWS’s annual revenue was $17.46 billion in 2017 according to their SEC 10-K filing, or roughly 10% of Amazon.com’s total 2017 revenue.
- Main reasons to use AWS:
- If your company is building systems or products that may need to scale
- and you have technical know-how
- and you want the most flexible tools
- and you’re not significantly tied into different infrastructure already
- and you don’t have internal, regulatory, or compliance reasons you can’t use a public cloud-based solution
- and you’re not on a Microsoft-first tech stack
- and you don’t have a specific reason to use Google Cloud
- and you can afford, manage, or negotiate its somewhat higher costs
- … then AWS is likely a good option for your company.
- Each of those reasons above might point to situations where other services are preferable. In practice, many, if not most, tech startups as well as a number of modern large companies can or already do benefit from using AWS. Many large enterprises are partly migrating internal infrastructure to Azure, Google Cloud, and AWS.
- Costs: Billing and cost management are such big topics that we have an entire section on this.
- ?EC2 vs. other services: Most users of AWS are most familiar with EC2, AWS’ flagship virtual server product, and possibly a few others like S3 and CLBs. But AWS products now extend far beyond basic IaaS, and often companies do not properly understand or appreciate all the many AWS services and how they can be applied, due to the sharply growing number of services, their novelty and complexity, branding confusion, and fear of ⛓lock-in to proprietary AWS technology. Although a bit daunting, it’s important for technical decision-makers in companies to understand the breadth of the AWS services and make informed decisions. (We hope this guide will help.)
- ?AWS vs. other cloud providers: While AWS is the dominant IaaS provider (31% market share in this 2016 estimate), there is significant competition and alternatives that are better suited to some companies. This Gartner report has a good overview of the major cloud players :
- Google Cloud Platform. GCP arrived later to market than AWS, but has vast resources and is now used widely by many companies, including a few large ones. It is gaining market share. Not all AWS services have similar or analogous services in GCP. And vice versa: In particular, GCP offers some more advanced machine learning-based services like the Vision, Speech, and Natural Language APIs. It’s not common to switch once you’re up and running, but it does happen: Spotify migrated from AWS to Google Cloud. There is more discussion on Quora about relative benefits. Of particular note is that VPCs in GCP are global by default with subnetworks per region, while AWS’ VPCs have to live within a particular region. This gives GCP an edge if you’re designing applications with geo-replication from the beginning. It’s also possible to share one GCP VPC between multiple projects (roughly analogous to AWS accounts), while in AWS you’d have to peer them. It’s also possible to peer GCP VPCs in a similar manner to how it’s done in AWS.
- Microsoft Azure is the de facto choice for companies and teams that are focused on a Microsoft stack, and it has now placed significant emphasis on Linux as well
- In China, AWS’ footprint is relatively small. The market is dominated by Alibaba’s Alibaba Cloud, formerly called Aliyun.
- Companies at (very) large scale may want to reduce costs by managing their own infrastructure. For example, Dropbox migrated to their own infrastructure.
- Other cloud providers such as Digital Ocean offer similar services, sometimes with greater ease of use, more personalized support, or lower cost. However, none of these match the breadth of products, mind-share, and market domination AWS now enjoys.
- Traditional managed hosting providers such as Rackspace offer cloud solutions as well.
- ?AWS vs. PaaS: If your goal is just to put up a single service that does something relatively simple, and you’re trying to minimize time managing operations engineering, consider a platform-as-a-service such as Heroku. The AWS approach to PaaS, Elastic Beanstalk, is arguably more complex, especially for simple use cases.
- ?AWS vs. web hosting: If your main goal is to host a website or blog, and you don’t expect to be building an app or more complex service, you may wish consider one of the myriad web hosting services.
- ?AWS vs. managed hosting: Traditionally, many companies pay managed hosting providers to maintain physical servers for them, then build and deploy their software on top of the rented hardware. This makes sense for businesses who want direct control over hardware, due to legacy, performance, or special compliance constraints, but is usually considered old fashioned or unnecessary by many developer-centric startups and younger tech companies.
- Complexity: AWS will let you build and scale systems to the size of the largest companies, but the complexity of the services when used at scale requires significant depth of knowledge and experience. Even very simple use cases often require more knowledge to do “right” in AWS than in a simpler environment like Heroku or Digital Ocean. (This guide may help!)
- Geographic locations: AWS has data centers in over a dozen geographic locations, known as regions, in Europe, East Asia, North and South America, and now Australia and India. It also has many more edge locations globally for reduced latency of services like CloudFront.
- See the current list of regions and edge locations, including upcoming ones.
- If your infrastructure needs to be in close physical proximity to another service for latency or throughput reasons (for example, latency to an ad exchange), viability of AWS may depend on the location.
- ⛓Lock-in: As you use AWS, it’s important to be aware when you are depending on AWS services that do not have equivalents elsewhere.
- Lock-in may be completely fine for your company, or a significant risk. It’s important from a business perspective to make this choice explicitly, and consider the cost, operational, business continuity, and competitive risks of being tied to AWS. AWS is such a dominant and reliable vendor, many companies are comfortable with using AWS to its full extent. Others can tell stories about the dangers of “cloud jail” when costs spiral.
- Generally, the more AWS services you use, the more lock-in you have to AWS — that is, the more engineering resources (time and money) it will take to change to other providers in the future.
- Basic services like virtual servers and standard databases are usually easy to migrate to other providers or on premises. Others like load balancers and IAM are specific to AWS but have close equivalents from other providers. The key thing to consider is whether engineers are architecting systems around specific AWS services that are not open source or relatively interchangeable. For example, Lambda, API Gateway, Kinesis, Redshift, and DynamoDB do not have substantially equivalent open source or commercial service equivalents, while EC2, RDS (MySQL or Postgres), EMR, and ElastiCache more or less do. (See more below, where these are noted with ⛓.)
- Combining AWS and other cloud providers: Many customers combine AWS with other non-AWS services. For example, legacy systems or secure data might be in a managed hosting provider, while other systems are AWS. Or a company might only use S3 with another provider doing everything else. However small startups or projects starting fresh will typically stick to AWS or Google Cloud only.
- Hybrid cloud: In larger enterprises, it is common to have hybrid deployments encompassing private cloud or on-premises servers and AWS — or other enterprise cloud providers like IBM/Bluemix, Microsoft/Azure, NetApp, or EMC.
- Major customers: Who uses AWS and Google Cloud?
- AWS’s list of customers includes large numbers of mainstream online properties and major brands, such as Netflix, Pinterest, Spotify (moving to Google Cloud), Airbnb, Expedia, Yelp, Zynga, Comcast, Nokia, and Bristol-Myers Squibb.
- Azure’s list of customers includes companies such as NBC Universal, 3M and Honeywell Inc.
- Google Cloud’s list of customers is large as well, and includes a few mainstream sites, such as Snapchat, Best Buy, Domino’s, and Sony Music.
Which Services to Use
- AWS offers a lot of different services — about a hundred at last count.
- Most customers use a few services heavily, a few services lightly, and the rest not at all. What services you’ll use depends on your use cases. Choices differ substantially from company to company.
- Immature and unpopular services: Just because AWS has a service that sounds promising, it doesn’t mean you should use it. Some services are very narrow in use case, not mature, are overly opinionated, or have limitations, so building your own solution may be better. We try to give a sense for this by breaking products into categories.
- Must-know infrastructure: Most typical small to medium-size users will focus on the following services first. If you manage use of AWS systems, you likely need to know at least a little about all of these. (Even if you don’t use them, you should learn enough to make that choice intelligently.)
- IAM: User accounts and identities (you need to think about accounts early on!)
- EC2: Virtual servers and associated components, including:
- S3: Storage of files
- Route 53: DNS and domain registration
- VPC: Virtual networking, network security, and co-location; you automatically use
- CloudFront: CDN for hosting content
- CloudWatch: Alerts, paging, monitoring
- Managed services: Existing software solutions you could run on your own, but with managed deployment:
- Optional but important infrastructure: These are key and useful infrastructure components that are less widely known and used. You may have legitimate reasons to prefer alternatives, so evaluate with care to be sure they fit your needs:
- ⛓Lambda: Running small, fully managed tasks “serverless”
- CloudTrail: AWS API logging and audit (often neglected but important)
- ⛓?CloudFormation: Templatized configuration of collections of AWS resources
- ?Elastic Beanstalk: Fully managed (PaaS) deployment of packaged Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker applications
- ?EFS: Network filesystem compatible with NFSv4.1
- ⛓?ECS: Docker container/cluster management (note Docker can also be used directly, without ECS)
- ? EKS: Kubernetes (K8) Docker Container/Cluster management
- ⛓ECR: Hosted private Docker registry
- ?Config: AWS configuration inventory, history, change notifications
- ?X-Ray: Trace analysis and debugging for distributed applications such as microservices.
- Special-purpose infrastructure: These services are focused on specific use cases and should be evaluated if they apply to your situation. Many also are proprietary architectures, so tend to tie you to AWS.
- ⛓DynamoDB: Low-latency NoSQL key-value store
- ⛓Glacier: Slow and cheap alternative to S3
- ⛓Kinesis: Streaming (distributed log) service
- ⛓SQS: Message queueing service
- ⛓Redshift: Data warehouse
- ?QuickSight: Business intelligence service
- SES: Send and receive e-mail for marketing or transactions
- ⛓API Gateway: Proxy, manage, and secure API calls
- ⛓IoT: Manage bidirectional communication over HTTP, WebSockets, and MQTT between AWS and clients (often but not necessarily “things” like appliances or sensors)
- ⛓WAF: Web firewall for CloudFront to deflect attacks
- ⛓KMS: Store and manage encryption keys securely
- Inspector: Security audit
- Trusted Advisor: Automated tips on reducing cost or making improvements
- ?Certificate Manager: Manage SSL/TLS certificates for AWS services
- ?⛓Fargate: Docker containers management, backend for ECS and EKS
- Compound services: These are similarly specific, but are full-blown services that tackle complex problems and may tie you in. Usefulness depends on your requirements. If you have large or significant need, you may have these already managed by in-house systems and engineering teams.
- Machine Learning: Machine learning model training and classification
- Lex: Automatic speech recognition (ASR) and natural language understanding (NLU)
- Polly: Text-to-speech engine in the cloud
- Rekognition: Service for image recognition
- ⛓?Data Pipeline: Managed ETL service
- ⛓?SWF: Managed state tracker for distributed polyglot job workflow
- ⛓?Lumberyard: 3D game engine
- Mobile/app development:
- Enterprise services: These are relevant if you have significant corporate cloud-based or hybrid needs. Many smaller companies and startups use other solutions, like Google Apps or Box. Larger companies may also have their own non-AWS IT solutions.
- AppStream: Windows apps in the cloud, with access from many devices
- Workspaces: Windows desktop in the cloud, with access from many devices
- WorkDocs (formerly Zocalo): Enterprise document sharing
- WorkMail: Enterprise managed e-mail and calendaring service
- Directory Service: Microsoft Active Directory in the cloud
- Direct Connect: Dedicated network connection between office or data center and AWS
- Storage Gateway: Bridge between on-premises IT and cloud storage
- Service Catalog: IT service approval and compliance
- Probably-don’t-need-to-know services: Bottom line, our informal polling indicates these services are just not broadly used — and often for good reasons:
- Snowball: If you want to ship petabytes of data into or out of Amazon using a physical appliance, read on.
- Snowmobile: Appliances are great, but if you’ve got exabyte scale data to get into Amazon, nothing beats a tractor trailer full of drives.
- CodeCommit: Git service. You’re probably already using GitHub or your own solution (Stackshare has informal stats).
- ?CodePipeline: Continuous integration. You likely have another solution already.
- ?CodeDeploy: Deployment of code to EC2 servers. Again, you likely have another solution.
- ?OpsWorks: Management of your deployments using Chef or (as of November 2017) Puppet Enterprise.
- AWS in Plain English offers more friendly explanation of what all the other different services are.
Tools and Services Market Landscape
There are now enough cloud and “big data” enterprise companies and products that few can keep up with the market landscape. (See the Big Data Evolving Landscape – 2016 for one attempt at this.)
We’ve assembled a landscape of a few of the services. This is far from complete, but tries to emphasize services that are popular with AWS practitioners — services that specifically help with AWS, or a complementary, or tools almost anyone using AWS must learn.
? Suggestions to improve this figure? Please file an issue.
- ? The AWS General Reference covers a bunch of common concepts that are relevant for multiple services.
- AWS allows deployments in regions, which are isolated geographic locations that help you reduce latency or offer additional redundancy. Regions contain availability zones(AZs), which are typically the first tool of choice for high availability). AZs are physically separate from one another even within the same region, and may span multiple physical data centers. While they are connected via low latency links, natural disasters afflicting one should not affect others.
- Each service has API endpoints for each region. Endpoints differ from service to service and not all services are available in each region, as listed in these tables.
- Amazon Resource Names (ARNs) are specially formatted identifiers for identifying resources. They start with ‘arn:’ and are used in many services, and in particular for IAM policies.
Many services within AWS can at least be compared with Google Cloud offerings or with internal Google services. And often times you could assemble the same thing yourself with open source software. This table is an effort at listing these rough correspondences. (Remember that this table is imperfect as in almost every case there are subtle differences of features!)
|Service||AWS||Google Cloud||Google Internal||Microsoft Azure||Other providers||Open source “build your own”||Openstack|
|Virtual server||EC2||Compute Engine (GCE)||Virtual Machine||DigitalOcean||OpenStack||Nova|
|PaaS||Elastic Beanstalk||App Engine||App Engine||Web Apps||Heroku, AppFog, OpenShift||Meteor, AppScale, Cloud Foundry, Convox|
|Serverless, microservices||Lambda, API Gateway||Functions||Function Apps||PubNub Blocks, Auth0 Webtask||Kong, Tyk||Qinling|
|Container, cluster manager||ECS, EKS, Fargate||Container Engine, Kubernetes||Borg or Omega||Container Service||Kubernetes, Mesos, Aurora||Zun|
|Object storage||S3||Cloud Storage||GFS||Storage Account||DigitalOcean Spaces||Swift, HDFS, Minio||Swift|
|Block storage||EBS||Persistent Disk||Storage Account||DigitalOcean Volumes||NFS||Cinder|
|SQL datastore||RDS||Cloud SQL||SQL Database||MySQL, PostgreSQL||Trove (stores NoSQL as well)|
|Sharded RDBMS||Cloud Spanner||F1, Spanner||Crate.io, CockroachDB|
|Key-value store, column store||DynamoDB||Cloud Datastore||Megastore||Tables, DocumentDB||Cassandra, CouchDB, RethinkDB, Redis|
|Memory cache||ElastiCache||App Engine Memcache||Redis Cache||Memcached, Redis|
|Search||CloudSearch, Elasticsearch (managed)||Search||Algolia, QBox, Elastic Cloud||Elasticsearch, Solr|
|Data warehouse||Redshift||BigQuery||Dremel||SQL Data Warehouse||Oracle, IBM, SAP, HP, many others||Greenplum|
|Business intelligence||QuickSight||Data Studio 360||Power BI||Tableau|
|Lock manager||DynamoDB (weak)||Chubby||Lease blobs in Storage Account||ZooKeeper, Etcd, Consul|
|Message broker||SQS, SNS, IoT||Pub/Sub||PubSub2||Service Bus||RabbitMQ, Kafka, 0MQ|
|Streaming, distributed log||Kinesis||Dataflow||PubSub2||Event Hubs||Kafka Streams, Apex, Flink, Spark Streaming, Storm|
|MapReduce||EMR||Dataproc||MapReduce||HDInsight, DataLake Analytics||Qubole||Hadoop|
|Metric management||Borgmon, TSDB||Application Insights||Graphite, InfluxDB, OpenTSDB, Grafana, Riemann, Prometheus|
|CDN||CloudFront||Cloud CDN||CDN||Akamai, Fastly, Cloudflare, Limelight Networks||Apache Traffic Server|
|Load balancer||CLB/ALB||Load Balancing||GFE||Load Balancer, Application Gateway||nginx, HAProxy, Apache Traffic Server|
|SES||Sendgrid, Mandrill, Postmark|
|Git hosting||CodeCommit||Cloud Source Repositories||Visual Studio Team Services||GitHub, BitBucket||GitLab|
|User authentication||Cognito||Firebase Authentication||Azure Active Directory||oauth.io|
|Mobile app analytics||Mobile Analytics||Firebase Analytics||HockeyApp||Mixpanel|
|Mobile app testing||Device Farm||Firebase Test Lab||Xamarin Test Cloud||BrowserStack, Sauce Labs, Testdroid|
|Managing SSL/TLS certificates||Certificate Manager||Let’s Encrypt, Comodo, Symantec, GlobalSign|
|Automatic speech recognition and natural language understanding||Lex||Cloud Speech API, Natural Language API||Cognitive services||AYLIEN Text Analysis API, Ambiverse Natural Language Understanding API||Stanford’s Core NLP Suite, Apache OpenNLP, Apache UIMA, spaCy|
|Text-to-speech engine in the cloud||Polly||Nuance, Vocalware, IBM||Mimic, eSpeak, MaryTTS|
|Image recognition||Rekognition||Vision API||Cognitive services||IBM Watson, Clarifai||TensorFlow, OpenCV|
|File Share and Sync||WorkDocs||Google Docs||OneDrive||Dropbox, Box, Citrix File Share||ownCloud|
|Machine Learning||SageMaker, DeepLens, ML||ML Engine, Auto ML||ML Studio||Watson ML|
Selected resources with more detail on this chart:
AWS Product Maturity and Releases
It’s important to know the maturity of each AWS product. Here is a mostly complete list of first release date, with links to the release notes. Most recently released services are first. Not all services are available in all regions; see this table.
|Service||Original release||Availability||CLI Support||HIPAA Compliant||PCI-DSS Compliant|
|?Database Migration Service||2016-03||General||✓||✓|
|Fulfillment Web Service||2008-03||Obsolete?|
|Flexible Payments Service||2007-08||Retired|
|Alexa Top Sites||2006-01||General ❗HTTP-only|
|Alexa Web Information Service||2005-10||General ❗HTTP-only|
1: Excludes use of Amazon API Gateway caching
2: RDS MySQL, Oracle, and PostgreSQL engines only
3: MySQL-compatible Aurora edition only
4: Excludes Lambda@Edge
5: Includes EC2 Systems Manager
6: Includes Elastic Block Storage (EBS)
7: Includes Elastic Load Balancing
8: Includes S3 Transfer Acceleration
9: Includes RDS MySQL, Oracle, PostgreSQL, SQL Server, and MariaDB
10: Includes Auto-Scaling
11: Streams only
12: Kubernetes uses a custom CLI for Pod/Service management called kubectl. AWS CLI only handles Kubernetes Master concerns
- Many applications have strict requirements around reliability, security, or data privacy. The AWS Compliance page has details about AWS’s certifications, which include PCI DSS Level 1, SOC 1,2, and 3, HIPAA, and ISO 9001.
- Security in the cloud is a complex topic, based on a shared responsibility model, where some elements of compliance are provided by AWS, and some are provided by your company.
- Several third-party vendors offer assistance with compliance, security, and auditing on AWS. If you have substantial needs in these areas, assistance is a good idea.
- From inside China, AWS services outside China are generally accessible, though there are at times breakages in service. There are also AWS services inside China.
Getting Help and Support
- Forums: For many problems, it’s worth searching or asking for help in the discussion forums to see if it’s a known issue.
- Premium support: AWS offers several levels of premium support.
- The first tier, called “Developer support” lets you file support tickets with 12 to 24 hour turnaround time, it starts at $29 but once your monthly spend reaches around $1000 it changes to a 3% surcharge on your bill.
- The higher-level support services are quite expensive — and increase your bill by up to 10%. Many large and effective companies never pay for this level of support. They are usually more helpful for midsize or larger companies needing rapid turnaround on deeper or more perplexing problems.
- Keep in mind, a flexible architecture can reduce need for support. You shouldn’t be relying on AWS to solve your problems often. For example, if you can easily re-provision a new server, it may not be urgent to solve a rare kernel-level issue unique to one EC2 instance. If your EBS volumes have recent snapshots, you may be able to restore a volume before support can rectify the issue with the old volume. If your services have an issue in one availability zone, you should in any case be able to rely on a redundant zone or migrate services to another zone.
- Larger customers also get access to AWS Enterprise support, with dedicated technical account managers (TAMs) and shorter response time SLAs.
- There is definitely some controversy about how useful the paid support is. The support staff don’t always seem to have the information and authority to solve the problems that are brought to their attention. Often your ability to have a problem solved may depend on your relationship with your account rep.
- Account manager: If you are at significant levels of spend (thousands of US dollars plus per month), you may be assigned (or may wish to ask for) a dedicated account manager.
- These are a great resource, even if you’re not paying for premium support. Build a good relationship with them and make use of them, for questions, problems, and guidance.
- Assign a single point of contact on your company’s side, to avoid confusing or overwhelming them.
- Contact: The main web contact point for AWS is here. Many technical requests can be made via these channels.
- Consulting and managed services: For more hands-on assistance, AWS has established relationships with many consulting partners and managed service partners. The big consultants won’t be cheap, but depending on your needs, may save you costs long term by helping you set up your architecture more effectively, or offering specific expertise, e.g. security. Managed service providers provide longer-term full-service management of cloud resources.
- AWS Professional Services: AWS provides consulting services alone or in combination with partners.
Restrictions and Other Notes
- ?Lots of resources in Amazon have limits on them. This is actually helpful, so you don’t incur large costs accidentally. You have to request that quotas be increased by opening support tickets. Some limits are easy to raise, and some are not. (Some of these are noted in sections below.) Additionally, not all service limits are published.
- Obtaining Current Limits and Usage: Limit information for a service may be available from the service API, Trusted Advisor, both or neither (in which case you’ll need to contact Support). This page from the awslimitchecker tool’s documentation provides a nice summary of available retrieval options for each limit. The tool itself is also valuable for automating limit checks.
- ?AWS terms of service are extensive. Much is expected boilerplate, but it does contain important notes and restrictions on each service. In particular, there are restrictions against using many AWS services in safety-critical systems. (Those appreciative of legal humor may wish to review clause 57.10.)
- OpenStack is a private cloud alternative to AWS used by large companies that wish to avoid public cloud offerings.
Learning and Career Development
- Certifications: AWS offers certifications for IT professionals who want to demonstrate their knowledge.
- Certified Cloud Practitioner
- Certified Solutions Architect Associate
- Certified Developer Associate
- Certified SysOps Administrator Associate
- Certified Solutions Architect Professional
- Certified DevOps Engineer Professional
- Getting certified: If you’re interested in studying for and getting certifications, this practical overview tells you a lot of what you need to know. The official page is here and there is an FAQ.
- Do you need a certification? Especially in consulting companies or when working in key tech roles in large non-tech companies, certifications are important credentials. In others, including in many tech companies and startups, certifications are not common or considered necessary. (In fact, fairly or not, some Silicon Valley hiring managers and engineers see them as a “negative” signal on a resume.)
Managing Infrastructure State and Change
A great challenge in using AWS to build complex systems (and with DevOps in general) is to manage infrastructure state effectively over time. In general, this boils down to three broad goals for the state of your infrastructure:
- Visibility: Do you know the state of your infrastructure (what services you are using, and exactly how)? Do you also know when you — and anyone on your team — make changes? Can you detect misconfigurations, problems, and incidents with your service?
- Automation: Can you reconfigure your infrastructure to reproduce past configurations or scale up existing ones without a lot of extra manual work, or requiring knowledge that’s only in someone’s head? Can you respond to incidents easily or automatically?
- Flexibility: Can you improve your configurations and scale up in new ways without significant effort? Can you add more complexity using the same tools? Do you share, review, and improve your configurations within your team?
Much of what we discuss below is really about how to improve the answers to these questions.
There are several approaches to deploying infrastructure with AWS, from the console to complex automation tools, to third-party services, all of which attempt to help achieve visibility, automation, and flexibility.
AWS Configuration Management
The first way most people experiment with AWS is via its web interface, the AWS Console. But using the Console is a highly manual process, and often works against automation or flexibility.
So if you’re not going to manage your AWS configurations manually, what should you do? Sadly, there are no simple, universal answers — each approach has pros and cons, and the approaches taken by different companies vary widely, and include directly using APIs (and building tooling on top yourself), using command-line tools, and using third-party tools and services.
- The AWS Console lets you control much (but not all) functionality of AWS via a web interface.
- Ideally, you should only use the AWS Console in a few specific situations:
- It’s great for read-only usage. If you’re trying to understand the state of your system, logging in and browsing it is very helpful.
- It is also reasonably workable for very small systems and teams (for example, one engineer setting up one server that doesn’t change often).
- It can be useful for operations you’re only going to do rarely, like less than once a month (for example, a one-time VPC setup you probably won’t revisit for a year). In this case using the console can be the simplest approach.
- ❗Think before you use the console: The AWS Console is convenient, but also the enemy of automation, reproducibility, and team communication. If you’re likely to be making the same change multiple times, avoid the console. Favor some sort of automation, or at least have a path toward automation, as discussed next. Not only does using the console preclude automation, which wastes time later, but it prevents documentation, clarity, and standardization around processes for yourself and your team.
- The aws command-line interface (CLI), used via the aws command, is the most basic way to save and automate AWS operations.
- Don’t underestimate its power. It also has the advantage of being well-maintained — it covers a large proportion of all AWS services, and is up to date.
- In general, whenever you can, prefer the command line to the AWS Console for performing operations.
- ?Even in the absence of fancier tools, you can write simple Bash scripts that invoke aws with specific arguments, and check these into Git. This is a primitive but effective way to document operations you’ve performed. It improves automation, allows code review and sharing on a team, and gives others a starting point for future work.
- ?For use that is primarily interactive (not scripted), consider instead using the aws-shell tool from AWS. It is easier to use, with auto-completion and a colorful UI, but still works on the command line. If you’re using SAWS, a previous version of the program, you should migrate to aws-shell.
APIs and SDKs
- Retry logic: An important aspect to consider whenever using SDKs is error handling; under heavy use, a wide variety of failures, from programming errors to throttling to AWS-related outages or failures, can be expected to occur. SDKs typically implement exponential backoff to address this, but this may need to be understood and adjusted over time for some applications. For example, it is often helpful to alert on some error codes and not on others.
- ❗Don’t use APIs directly. Although AWS documentation includes lots of API details, it’s better to use the SDKs for your preferred language to access APIs. SDKs are more mature, robust, and well-maintained than something you’d write yourself.
- A good way to automate operations in a custom way is Boto3, also known as the Amazon SDK for Python. Boto2, the previous version of this library, has been in wide use for years, but now there is a newer version with official support from Amazon, so prefer Boto3 for new projects.
- Boto3 contains a variety of APIs that operate at either a high level or a low level, here some explanation of both:
- The low level APIs (Client APIs) are mapped to AWS Cloud service-specific APIs, and all service operations are supported by clients. Clients are generated from a JSON service definition file.
- The high level option, Resource APIs, allows you to avoid calling the network at the low level and instead provide an object-oriented way to interact with AWS Cloud services.
- Boto3 has a lot of helpful features like waiters, which provide a structure that allows for code to wait for changes to occur in the cloud, for example, when you are creating an EC2 instance and need wait until the instance is running in order to perform another task.
- If you find yourself writing a Bash script with more than one or two CLI commands, you’re probably doing it wrong. Stop, and consider writing a Boto script instead. This has the advantages that you can:
- Check return codes easily so success of each step depends on success of past steps.
- Grab interesting bits of data from responses, like instance ids or DNS names.
- Add useful environment information (for example, tag your instances with git revisions, or inject the latest build identifier into your initialization script).
- ?Tagging resources is an essential practice, especially as organizations grow, to better understand your resource usage. For example, through automation or convention, you can add tags:
- For the org or developer that “owns” that resource
- For the product that resource supports
- To label lifecycles, such as temporary resources or one that should be deprovisioned in the future
- To distinguish production-critical infrastructure (e.g. serving systems vs backend pipelines)
- To distinguish resources with special security or compliance requirements
- To (once enabled) allocate cost. Note that cost allocation tags only apply on a forward-looking basis; you can’t retroactively apply them to items already billed.
- For many years, there was a notorious 10 tag limit per resource, which could not be raised and caused many companies significant pain. As of 2016, this was raised to 50 tags per resource.
- ?In 2017, AWS introduced the ability to enforce tagging on instance and volume creation, deprecating portions of third party tools such as Cloud Custodian.
- ? Tags are case sensitve; ‘environment’ and ‘Environment’ are two different tags. Automation in setting tags is likely the only sensible option at significant scale.
- ? There is a bug in the ASG console where spaces after tag names are preserved. So if you type “Name ” with a space at the end you will not get the expected behavior. This is probably true in other locations and SDKs also. Be sure you do not add trailing spaces to tag keys unless you really mean it. (As of Jul 2018)
Managing Servers and Applications
AWS vs Server Configuration
This guide is about AWS, not DevOps or server configuration management in general. But before getting into AWS in detail, it’s worth noting that in addition to the configuration management for your AWS resources, there is the long-standing problem of configuration management for servers themselves.
- Heroku’s Twelve-Factor App principles list some established general best practices for deploying applications.
- Pets vs cattle: Treat servers like cattle, not pets. That is, design systems so infrastructure is disposable. It should be minimally worrisome if a server is unexpectedly destroyed.
- The concept of immutable infrastructure is an extension of this idea.
- Minimize application state on EC2 instances. In general, instances should be able to be killed or die unexpectedly with minimal impact. State that is in your application should quickly move to RDS, S3, DynamoDB, EFS, or other data stores not on that instance. EBS is also an option, though it generally should not be the bootable volume, and EBS will require manual or automated re-mounting.
Server Configuration Management
- There is a large set of open source tools for managing configuration of server instances.
- These are generally not dependent on any particular cloud infrastructure, and work with any variety of Linux (or in many cases, a variety of operating systems).
- Leading configuration management tools are Puppet, Chef, Ansible, and Saltstack. These aren’t the focus of this guide, but we may mention them as they relate to AWS.
Containers and AWS
- Docker and the containerization trend are changing the way many servers and services are deployed in general.
- Containers are designed as a way to package up your application(s) and all of their dependencies in a known way. When you build a container, you are including every library or binary your application needs, outside of the kernel. A big advantage of this approach is that it’s easy to test and validate a container locally without worrying about some difference between your computer and the servers you deploy on.
- A consequence of this is that you need fewer AMIs and boot scripts; for most deployments, the only boot script you need is a template that fetches an exported docker image and runs it.
- Companies that are embracing microservice architectures will often turn to container-based deployments.
- AWS launched ECS as a service to manage clusters via Docker in late 2014, though many people still deploy Docker directly themselves. See the ECS section for more details.
- AWS launched EKS as a service to manage Kubernetes Clusters mid 2018, though many people still deploy ECS or use Docker directly themselves. See the EKS section for more details.
- Store and track instance metadata (such as instance id, availability zone, etc.) and deployment info (application build id, Git revision, etc.) in your logs or reports. The instance metadata service can help collect some of the AWS data you’ll need.
- Use log management services: Be sure to set up a way to view and manage logs externally from servers.
- Cloud-based services such as Sumo Logic, Splunk Cloud, Scalyr, LogDNA, and Loggly are the easiest to set up and use (and also the most expensive, which may be a factor depending on how much log data you have).
- Major open source alternatives include Elasticsearch, Logstash, and Kibana (the “Elastic Stack”) and Graylog.
- If you can afford it (you have little data or lots of money) and don’t have special needs, it makes sense to use hosted services whenever possible, since setting up your own scalable log processing systems is notoriously time consuming.
- Track and graph metrics: The AWS Console can show you simple graphs from CloudWatch, you typically will want to track and graph many kinds of metrics, from CloudWatch and your applications. Collect and export helpful metrics everywhere you can (and as long as volume is manageable enough you can afford it).
- Services like Librato, KeenIO, and Datadog have fancier features or better user interfaces that can save a lot of time. (A more detailed comparison is here.)
- Use Prometheus or Graphite as timeseries databases for your metrics (both are open source).
- Grafana can visualize with dashboards the stored metrics of both timeseries databases (also open source).
Tips for Managing Servers
- ❗Timezone settings on servers: unless absolutely necessary, always set the timezone on servers to UTC (see instructions for your distribution, such as Ubuntu, CentOS or Amazon Linux). Numerous distributed systems rely on time for synchronization and coordination and UTC provides the universal reference plane: it is not subject to daylight savings changes and adjustments in local time. It will also save you a lot of headache debugging elusive timezone issues and provide coherent timeline of events in your logging and audit systems.
- NTP and accurate time: If you are not using Amazon Linux (which comes preconfigured), you should confirm your servers configure NTP correctly, to avoid insidious time drift (which can then cause all sorts of issues, from breaking API calls to misleading logs). This should be part of your automatic configuration for every server. If time has already drifted substantially (generally >1000 seconds), remember NTP won’t shift it back, so you may need to remediate manually (for example, like this on Ubuntu).
- Testing immutable infrastructure: If you want to be proactive about testing your service’s ability to cope with instance termination or failure, it can be helpful to introduce random instance termination during business hours, which will expose any such issues at a time when engineers are available to identify and fix them. Netflix’s Simian Army (specifically, Chaos Monkey) is a popular tool for this. Alternatively, chaos-lambda by the BBC is a lightweight option which runs on AWS Lambda.
Security and IAM
We cover security basics first, since configuring user accounts is something you usually have to do early on when setting up your system.
Security and IAM Basics
- ? IAM Homepage ∙ User guide ∙ FAQ
- The AWS Security Blog is one of the best sources of news and information on AWS security.
- IAM is the service you use to manage accounts and permissioning for AWS.
- Managing security and access control with AWS is critical, so every AWS administrator needs to use and understand IAM, at least at a basic level.
- IAM identities include users (people or services that are using AWS), groups (containers for sets of users and their permissions), and roles (containers for permissions assigned to AWS service instances). Permissions for these identities are governed by policies You can use AWS pre-defined policies or custom policies that you create.
- IAM manages various kinds of authentication, for both users and for software services that may need to authenticate with AWS, including:
- Passwords to log into the console. These are a username and password for real users.
- Access keys, which you may use with command-line tools. These are two strings, one the “id”, which is an upper-case alphabetic string of the form ‘AXXXXXXXXXXXXXXXXXXX’, and the other is the secret, which is a 40-character mixed-case base64-style string. These are often set up for services, not just users.
- ? Access keys that start with AKIA are normal keys. Access keys that start with ASIA are session/temporary keys from STS, and will require an additional “SessionToken” parameter to be sent along with the id and secret. See the documentation for a complete list of access key prefixes.
- Multi-factor authentication (MFA), which is the highly recommended practice of using a keychain fob or smartphone app as a second layer of protection for user authentication.
- IAM allows complex and fine-grained control of permissions, dividing users into groups, assigning permissions to roles, and so on. There is a policy language that can be used to customize security policies in a fine-grained way.
- At the beginning, IAM policy may be very simple, but for large systems, it will grow in complexity, and need to be managed with care.
- ?Make sure one person (perhaps with a backup) in your organization is formally assigned ownership of managing IAM policies, make sure every administrator works with that person to have changes reviewed. This goes a long way to avoiding accidental and serious misconfigurations.
- It is best to give each user or service the minimum privileges needed to perform their duties. This is the principle of least privilege, one of the foundations of good security. Organize all IAM users and groups according to levels of access they need.
- IAM has the permission hierarchy of:
- Explicit deny: The most restrictive policy wins.
- Explicit allow: Access permissions to any resource has to be explicitly given.
- Implicit deny: All permissions are implicitly denied by default.
- You can test policy permissions via the AWS IAM policy simulator tool tool. This is particularly useful if you write custom policies.
Security and IAM Tips
- ?Use IAM to create individual user accounts and use IAM accounts for all users from the beginning. This is slightly more work, but not that much.
- That way, you define different users, and groups with different levels of privilege (if you want, choose from Amazon’s default suggestions, of administrator, power user, etc.).
- This allows credential revocation, which is critical in some situations. If an employee leaves, or a key is compromised, you can revoke credentials with little effort.
- You can set up Active Directory federation to use organizational accounts in AWS.
- ❗Enable MFA on your account.
- You should always use MFA, and the sooner the better — enabling it when you already have many users is extra work.
- Unfortunately it can’t be enforced in software, so an administrative policy has to be established.
- Most users can use the Google Authenticator app (on iOS or Android) to support two-factor authentication. For the root account, consider a hardware fob.
- ❗Restrict use of significant IAM credentials as much as possible. Remember that in the cloud, loss of a highly capable IAM credential could essentially mean “game over,” for your deployment, your users, or your whole company.
- Do NOT use the Root User account other than when you initially create your account. Create custom IAM users and/or roles and use those for your applications instead.
- Lock up access and use of the root credentials as much as possible. Ideally they should be effectively “offline.” For critical deployments, this means attached to an actual MFA device, physically secured and rarely used.
- Do NOT use the Root User account other than when you initially create your account. Create custom IAM users and/or roles and use those for your applications instead.
- ❗Turn on CloudTrail: One of the first things you should do is enable CloudTrail. Even if you are not a security hawk, there is little reason not to do this from the beginning, so you have data on what has been happening in your AWS account should you need that information. You’ll likely also want to set up a log management service to search and access these logs.
- ?Use IAM roles for EC2: Rather than assign IAM users to applications like services and then sharing the sensitive credentials, define and assign roles to EC2 instances and have applications retrieve credentials from the instance metadata.
- Assign IAM roles by realm — for example, to development, staging, and production. If you’re setting up a role, it should be tied to a specific realm so you have clean separation. This prevents, for example, a development instance from connecting to a production database.
- Best practices: AWS’ list of best practices is worth reading in full up front.
- IAM Reference: This interactive reference for all IAM actions, effects, and resources is great to have open while writing new or trying to understand existing IAM policies.
- Multiple accounts: Decide on whether you want to use multiple AWS accounts and research how to organize access across them. Factors to consider:
- Number of users
- Importance of isolation
- Resource Limits
- Permission granularity
- API Limits
- Regulatory issues
- Size of infrastructure
- Cost of multi-account “overhead”: Internal AWS service management tools may need to be custom built or adapted.
- ?It can help to use separate AWS accounts for independent parts of your infrastructure if you expect a high rate of AWS API calls, since AWS throttles calls at the AWS account level.
- Inspector is an automated security assessment service from AWS that helps identify common security risks. This allows validation that you adhere to certain security practices and may help with compliance.
- Trusted Advisor addresses a variety of best practices, but also offers some basic security checks around IAM usage, security group configurations, and MFA. At paid support tiers, Trusted Advisor exposes additional checks around other areas, such as reserved instance optimization.
- Use KMS for managing keys: AWS offers KMS for securely managing encryption keys, which is usually a far better option than handling key security yourself. See below.
- AWS WAF is a web application firewall to help you protect your applications from common attack patterns.
- Security auditing:
- Security Monkey is an open source tool that is designed to assist with security audits.
- Scout2 is an open source tool that uses AWS APIs to assess an environment’s security posture. Scout2 is stable and actively maintained.
- ?Export and audit security settings: You can audit security policies simply by exporting settings using AWS APIs, e.g. using a Boto script like SecConfig.py (from this 2013 talk) and then reviewing and monitoring changes manually or automatically.
Security and IAM Gotchas and Limitations
- ❗Don’t share user credentials: It’s remarkably common for first-time AWS users to create one account and one set of credentials (access key or password), and then use them for a while, sharing among engineers and others within a company. This is easy. But don’t do this. This is an insecure practice for many reasons, but in particular, if you do, you will have reduced ability to revoke credentials on a per-user or per-service basis (for example, if an employee leaves or a key is compromised), which can lead to serious complications.
- ❗Instance metadata throttling: The instance metadata service has rate limiting on API calls. If you deploy IAM roles widely (as you should!) and have lots of services, you may hit global account limits easily.
- One solution is to have code or scripts cache and reuse the credentials locally for a short period (say 2 minutes). For example, they can be put into the ~/.aws/credentials file but must also be refreshed automatically.
- But be careful not to cache credentials for too long, as they expire. (Note the other dynamic metadata also changes over time and should not be cached a long time, either.)
- ?Some IAM operations are slower than other API calls (many seconds), since AWS needs to propagate these globally across regions.
- ❗The uptime of IAM’s API has historically been lower than that of the instance metadata API. Be wary of incorporating a dependency on IAM’s API into critical paths or subsystems — for example, if you validate a user’s IAM group membership when they log into an instance and aren’t careful about precaching group membership or maintaining a back door, you might end up locking users out altogether when the API isn’t available.
- ❗Don’t check in AWS credentials or secrets to a git repository. There are bots that scan GitHub looking for credentials. Use scripts or tools, such as git-secrets to prevent anyone on your team from checking in sensitive information to your git repositories.
- ? Homepage ∙ Developer guide ∙ FAQ ∙ Pricing
- S3 (Simple Storage Service) is AWS’ standard cloud storage service, offering file (opaque “blob”) storage of arbitrary numbers of files of almost any size, from 0 to 5TB. (Prior to 2011 the maximum size was 5 GB; larger sizes are now well supported via multipart support.)
- Items, or objects, are placed into named buckets stored with names which are usually called keys. The main content is the value.
- Objects are created, deleted, or updated. Large objects can be streamed, but you cannot modify parts of a value; you need to update the whole object. Partial data access can work via S3 Select.
- Every object also has metadata, which includes arbitrary key-value pairs, and is used in a way similar to HTTP headers. Some metadata is system-defined, some are significant when serving HTTP content from buckets or CloudFront, and you can also define arbitrary metadata for your own use.
- S3 URIs: Although often bucket and key names are provided in APIs individually, it’s also common practice to write an S3 location in the form ‘s3://bucket-name/path/to/key’ (where the key here is ‘path/to/key’). (You’ll also see ‘s3n://’ and ‘s3a://’ prefixes in Hadoop systems.)
- S3 vs Glacier, EBS, and EFS: AWS offers many storage services, and several besides S3 offer file-type abstractions. Glacier is for cheaper and infrequently accessed archival storage. EBS, unlike S3, allows random access to file contents via a traditional filesystem, but can only be attached to one EC2 instance at a time. EFS is a network filesystem many instances can connect to, but at higher cost. See the comparison table.
- For most practical purposes, you can consider S3 capacity unlimited, both in total size of files and number of objects. The number of objects in a bucket is essentially also unlimited. Customers routinely have millions of objects.
- ?If you’re storing business data on Amazon S3, it’s important to manage permissions sensibly. In 2017 companies like Dow Jones and Verizon saw data breaches due to poorly-chosen S3 configuration for sensitive data. Fixing this later can be a difficult task if you have a lot of assets and internal users.
- ?There are 3 different ways to grant permissions to access Amazon S3 content in your buckets.
- IAM policies use the familiar Identity and Authentication Management permission scheme to control access to specific operations.
- Bucket policies grant or deny permissions to an entire bucket. You might use this when hosting a website in S3, to make the bucket publicly readable, or to restrict access to a bucket by IP address. Amazon’s sample bucket policies show a number of use cases where these policies come in handy.
- Access Control Lists (ACLs) can also be applied to every bucket and object stored in S3. ACLs grant additional permissions beyond those specified in IAM or bucket policies. ACLs can be used to grant access to another AWS user, or to predefined groups like the general public. This is powerful but can be dangerous, because you need to inspect every object to see who has access.
- ?AWS’ predefined access control groups allow access that may not be what you’d expect from their names:
- “All Users”, or “Everyone”, grants permission to the general public, not only to users defined in your own AWS account. If an object is available to All Users, then it can be retrieved with a simple HTTP request of the form http://s3.amazonaws.com/bucket-name/filename. No authorization or signature is required to access data in this category.
- “Authenticated Users” grants permissions to anyone with an AWS account, again not limited to your own users. Because anyone can sign up for AWS, for all intents and purposes this is also open to the general public.
- “Log Delivery” group is used by AWS to write logs to buckets and should be safe to enable on the buckets that need it.
- A typical use case of this ACL is used in conjunction with the requester pays functionality of S3.
- ❗ Bucket permissions and object permissions are two different things and independent of each other. A private object in a public bucket can be seen when listing the bucket, but not downloaded. At the same time, a public object in a private bucket won’t be seen because the bucket contents can’t be listed, but can still be downloaded by anyone who knows its exact key. Users that don’t have access to set bucket permissions can still make objects public if they have s3:PutObjectAcl or s3:PutObjectVersionAcl permissions.
- ?In August 2017, AWS added AWS Config rules to ensure your S3 buckets are secure.
- ❗These AWS Config rules only check the security of your bucket policy and bucket-level ACLs. You can still create object ACLs that grant additional permissions, including opening files to the whole world.
- ?Do create new buckets if you have different types of data with different sensitivity levels. This is much less error prone than complex permissions rules. For example, if data is for administrators only, like log data, put it in a new bucket that only administrators can access.
- For more guidance, see:
- Bucket naming: Buckets are chosen from a global namespace (across all regions, even though S3 itself stores data in whichever S3 region you select), so you’ll find many bucket names are already taken. Creating a bucket means taking ownership of the name until you delete it. Bucket names have a few restrictions on them.
- Bucket names can be used as part of the hostname when accessing the bucket or its contents, like <bucket_name>.s3-us-east-1.amazonaws.com, as long as the name is DNS compliant.
- A common practice is to use the company name acronym or abbreviation to prefix (or suffix, if you prefer DNS-style hierarchy) all bucket names (but please, don’t use a check on this as a security measure — this is highly insecure and easily circumvented!).
- ?Bucket names with ‘.’ (periods) in them can cause certificate mismatches when used with SSL. Use ‘-‘ instead, since this then conforms with both SSL expectations and is DNS compliant.
- Versioning: S3 has optional versioning support, so that all versions of objects are preserved on a bucket. This is mostly useful if you want an archive of changes or the ability to back out mistakes (caution: it lacks the featureset of full version control systems like Git).
- Durability: Durability of S3 is extremely high, since internally it keeps several replicas. If you don’t delete it by accident, you can count on S3 not losing your data. (AWS offers the seemingly improbable durability rate of 99.999999999%, but this is a mathematical calculation based on independent failure rates and levels of replication — not a true probability estimate. Either way, S3 has had a very good record of durability.) Note this is much higher durability than EBS!
- ?S3 pricing depends on storage, requests, and transfer.
- For transfer, putting data into AWS is free, but you’ll pay on the way out. Transfer from S3 to EC2 in the same region is free. Transfer to other regions or the Internet in general is not free.
- Deletes are free.
- S3 Reduced Redundancy and Infrequent Access: Most people use the Standard storage class in S3, but there are other storage classes with lower cost:
- ?Reduced Redundancy Storage (RRS) has been effectively deprecated, and has lower durability (99.99%, so just four nines) than standard S3. Note that it no longer participates in S3 price reductions, so it offers worse redundancy for more money than standard S3. As a result, there’s no reason to use it.
- Infrequent Access (IA) lets you get cheaper storage in exchange for more expensive access. This is great for archives like logs you already processed, but might want to look at later. To get an idea of the cost savings when using Infrequent Access (IA), you can use this S3 Infrequent Access Calculator.
- Glacier is a third alternative discussed as a separate product.
- See the comparison table.
- ⏱Performance: Maximizing S3 performance means improving overall throughput in terms of bandwidth and number of operations per second.
- S3 is highly scalable, so in principle you can get arbitrarily high throughput. (A good example of this is S3DistCp.)
- But usually you are constrained by the pipe between the source and S3 and/or the level of concurrency of operations.
- Throughput is of course highest from within AWS to S3, and between EC2 instances and S3 buckets that are in the same region.
- Bandwidth from EC2 depends on instance type. See the “Network Performance” column at ec2instances.info.
- Throughput of many objects is extremely high when data is accessed in a distributed way, from many EC2 instances. It’s possible to read or write objects from S3 from hundreds or thousands of instances at once.
- However, throughput is very limited when objects accessed sequentially from a single instance. Individual operations take many milliseconds, and bandwidth to and from instances is limited.
- Therefore, to perform large numbers of operations, it’s necessary to use multiple worker threads and connections on individual instances, and for larger jobs, multiple EC2 instances as well.
- Multi-part uploads: For large objects you want to take advantage of the multi-part uploading capabilities (starting with minimum chunk sizes of 5 MB).
- Large downloads: Also you can download chunks of a single large object in parallel by exploiting the HTTP GET range-header capability.
- ?List pagination: Listing contents happens at 1000 responses per request, so for buckets with many millions of objects listings will take time.
- ❗Key prefixes: Previously randomness in the beginning of key names was necessary in order to avoid hot spots, but that is no longer necessary as of July, 2018.
- For data outside AWS, DirectConnect and S3 Transfer Acceleration can help. For S3 Transfer Acceleration, you payabout the equivalent of 1-2 months of storage for the transfer in either direction for using nearer endpoints.
- Command-line applications: There are a few ways to use S3 from the command line:
- Originally, s3cmd was the best tool for the job. It’s still used heavily by many.
- The regular aws command-line interface now supports S3 well, and is useful for most situations.
- s4cmd is a replacement, with greater emphasis on performance via multi-threading, which is helpful for large files and large sets of files, and also offers Unix-like globbing support.
- GUI applications: You may prefer a GUI, or wish to support GUI access for less technical users. Some options:
- The AWS Console does offer a graphical way to use S3. Use caution telling non-technical people to use it, however, since without tight permissions, it offers access to many other AWS features.
- Transmit is a good option on macOS for most use cases.
- Cyberduck is a good option on macOS and Windows with support for multipart uploads, ACLs, versioning, lifecycle configuration, storage classes and server side encryption (SSE-S3 and SSE-KMS).
- S3 and CloudFront: S3 is tightly integrated with the CloudFront CDN. See the CloudFront section for more information, as well as S3 transfer acceleration.
- Static website hosting:
- S3 has a static website hosting option that is simply a setting that enables configurable HTTP index and error pages and HTTP redirect support to public content in S3. It’s a simple way to host static assets or a fully static website.
- Consider using CloudFront in front of most or all assets:
- Like any CDN, CloudFront improves performance significantly.
- ?SSL is only supported on the built-in amazonaws.com domain for S3. S3 supports serving these sites through a custom domain, but not over SSL on a custom domain. However, CloudFront allows you to serve a custom domain over https. Amazon provides free SNI SSL/TLS certificates via Amazon Certificate Manager. SNI does not work on very outdated browsers/operating systems. Alternatively, you can provide your own certificate to use on CloudFront to support all browsers/operating systems for a fee.
- ?If you are including resources across domains, such as fonts inside CSS files, you may need to configure CORSfor the bucket serving those resources.
- Since pretty much everything is moving to SSL nowadays, and you likely want control over the domain, you probably want to set up CloudFront with your own certificate in front of S3 (and to ignore the AWS example on this as it is non-SSL only).
- That said, if you do, you’ll need to think through invalidation or updates on CloudFront. You may wish to include versions or hashes in filenames so invalidation is not necessary.
- Data lifecycles:
- When managing data, the understanding the lifecycle of the data is as important as understanding the data itself. When putting data into a bucket, think about its lifecycle — its end of life, not just its beginning.
- ?In general, data with different expiration policies should be stored under separate prefixes at the top level. For example, some voluminous logs might need to be deleted automatically monthly, while other data is critical and should never be deleted. Having the former in a separate bucket or at least a separate folder is wise.
- ?Thinking about this up front will save you pain. It’s very hard to clean up large collections of files created by many engineers with varying lifecycles and no coherent organization.
- Alternatively you can set a lifecycle policy to archive old data to Glacier. Be careful with archiving large numbers of small objects to Glacier, since it may actually cost more.
- There is also a storage class called Infrequent Access that has the same durability as Standard S3, but is discounted per GB. It is suitable for objects that are infrequently accessed.
- Data consistency: Understanding data consistency is critical for any use of S3 where there are multiple producers and consumers of data.
- Creation and updates to individual objects in S3 are atomic, in that you’ll never upload a new object or change an object and have another client see only part half the change.
- The uncertainty lies with when your clients and other clients see updates.
- New objects: If you create a new object, you’ll be able to read it instantly, which is called read-after-write consistency.
- Well, with the additional caveat that if you do a read on an object before it exists, then create it, you get eventual consistency (not read-after-write).
- This does not apply to any list operations; newly created objects are not guaranteed to appear in a list operation right away
- Updates to objects: If you overwrite or delete an object, you’re only guaranteed eventual consistency, i.e. the change will happen but you have no guarantee of when.
- ?For many use cases, treating S3 objects as immutable (i.e. deciding by convention they will be created or deleted but not updated) can greatly simplify the code that uses them, avoiding complex state management.
- ?Note that until 2015, ‘us-standard’ region had had a weaker eventual consistency model, and the other (newer) regions were read-after-write. This was finally corrected — but watch for many old blogs mentioning this!
- Slow updates: In practice, “eventual consistency” usually means within seconds, but expect rare cases of minutes or hours.
- S3 as a filesystem:
- In general S3’s APIs have inherent limitations that make S3 hard to use directly as a POSIX-style filesystem while still preserving S3’s own object format. For example, appending to a file requires rewriting, which cripples performance, and atomic rename of directories, mutual exclusion on opening files, and hardlinks are impossible.
- s3fs is a FUSE filesystem that goes ahead and tries anyway, but it has performance limitations and surprises for these reasons.
- Riofs (C) and Goofys (Go) are more recent efforts that attempt adopt a different data storage format to address those issues, and so are likely improvements on s3fs.
- S3QL (discussion) is a Python implementation that offers data de-duplication, snap-shotting, and encryption, but only one client at a time.
- ObjectiveFS (discussion) is a commercial solution that supports filesystem features and concurrent clients.
- If you are primarily using a VPC, consider setting up a VPC Endpoint for S3 in order to allow your VPC-hosted resources to easily access it without the need for extra network configuration or hops.
- Cross-region replication: S3 has a feature for replicating a bucket between one region and another. Note that S3 is already highly replicated within one region, so usually this isn’t necessary for durability, but it could be useful for compliance (geographically distributed data storage), lower latency, or as a strategy to reduce region-to-region bandwidth costs by mirroring heavily used data in a second region.
- IPv4 vs IPv6: For a long time S3 only supported IPv4 at the default endpoint https://BUCKET.s3.amazonaws.com. However, as of Aug 11, 2016 it now supports both IPv4 & IPv6! To use both, you have to enable dualstack either in your preferred API client or by directly using this url scheme https://BUCKET.s3.dualstack.REGION.amazonaws.com. This extends to S3 Transfer Acceleration as well.
- S3 event notifications: S3 can be configured to send an SNS notification, SQS message, or AWS Lambda function on bucket events.
- ?Limit your individual users (or IAM roles) to the minimal required S3 locations, and catalog the “approved” locations. Otherwise, S3 tends to become the dumping ground where people put data to random locations that are not cleaned up for years, costing you big bucks.
- If a bucket is deleted in S3, it can take up to 10 hours before a bucket with the same name can be created again. (discussion)
S3 Gotchas and Limitations
- ❗S3 buckets sit outside the VPC and can be accessed from anywhere in the world if bucket policies are not set to deny it. Read the permissions section above carefully, there are countless cases of buckets exposed to the public.
- ?For many years, there was a notorious 100-bucket limit per account, which could not be raised and caused many companies significant pain. As of 2015, you can request increases. You can ask to increase the limit, but it will still be capped (generally below ~1000 per account).
- ?Be careful not to make implicit assumptions about transactionality or sequencing of updates to objects. Never assume that if you modify a sequence of objects, the clients will see the same modifications in the same sequence, or if you upload a whole bunch of files, that they will all appear at once to all clients.
- ?S3 has an SLA with 99.9% uptime. If you use S3 heavily, you’ll inevitably see occasional error accessing or storing data as disks or other infrastructure fail. Availability is usually restored in seconds or minutes. Although availability is not extremely high, as mentioned above, durability is excellent.
- ?After uploading, any change that you make to the object causes a full rewrite of the object, so avoid appending-like behavior with regular files.
- ?Eventual data consistency, as discussed above, can be surprising sometimes. If S3 suffers from internal replication issues, an object may be visible from a subset of the machines, depending on which S3 endpoint they hit. Those usually resolve within seconds; however, we’ve seen isolated cases when the issue lingered for 20-30 hours.
- ?MD5s and multi-part uploads: In S3, the ETag header in S3 is a hash on the object. And in many cases, it is the MD5 hash. However, this is not the case in general when you use multi-part uploads. One workaround is to compute MD5s yourself and put them in a custom header (such as is done by s4cmd).
- ?Incomplete multi-part upload costs: Incomplete multi-part uploads accrue storage charges even if the upload fails and no S3 object is created. Amazon (and others) recommend using a lifecycle policy to clean up incomplete uploads and save on storage costs. Note that if you have many of these, it may be worth investigating whatever’s failing regularly.
- ?US Standard region: Previously, the us-east-1 region (also known as the US Standard region) was replicated across coasts, which led to greater variability of latency. Effective Jun 19, 2015 this is no longer the case. All Amazon S3 regions now support read-after-write consistency. Amazon S3 also renamed the US Standard region to the US East (N. Virginia) region to be consistent with AWS regional naming conventions.
- ?S3 authentication versions and regions: In newer regions, S3 only supports the latest authentication. If an S3 file operation using CLI or SDK doesn’t work in one region, but works correctly in another region, make sure you are using the latest authentication signature.
Storage Durability, Availability, and Price
As an illustration of comparative features and price, the table below gives S3 Standard, RRS, IA, in comparison with Glacier, EBS, EFS, and EC2 d2.xlarge instance store using Virginia region as of Sept 2017.
|Durability (per year)||Availability “designed”||Availability SLA||Storage (per TB per month)||GET or retrieve (per million)||Write or archive (per million)|
|S3 IA||Eleven 9s||99.9%||99%||$12.50||$1||$10|
|S3 RRS||99.99%||99.99%||99.9%||$24 (first TB)||$0.40||$5|
|S3 Standard||Eleven 9s||99.99%||99.9%||$23||$0.40||$5|
|EC2 d2.xlarge instance store||Unstated||Unstated||–||$25.44||$0||$0|
Especially notable items are in boldface. Sources: S3 pricing, S3 SLA, S3 FAQ, RRS info (note that this is considered deprecated), Glacier pricing, EBS availability and durability, EBS pricing, EFS pricing, EC2 SLA
- ? Homepage ∙ Documentation ∙ FAQ ∙ Pricing (see also ec2instances.info)
- EC2 (Elastic Compute Cloud) is AWS’ offering of the most fundamental piece of cloud computing: A virtual private server. These “instances” can run most Linux, BSD, and Windows operating systems. Internally, they’ve used a heavily modified Xen virtualization. That said, new instance classes are being introduced with a KVM derived hypervisor instead, called Nitro. So far, this is limited to the C5 and M5 instance types. Lastly, there’s a “bare metal hypervisor” available for i3.metal instances
- The term “EC2” is sometimes used to refer to the servers themselves, but technically refers more broadly to a whole collection of supporting services, too, like load balancing (CLBs/ALBs/NLBs), IP addresses (EIPs), bootable images (AMIs), security groups, and network drives (EBS) (which we discuss individually in this guide).
- ?**EC2 pricing** and cost management is a complicated topic. It can range from free (on the AWS free tier) to a lot, depending on your usage. Pricing is by instance type, by second or hour, and changes depending on AWS region and whether you are purchasing your instances On-Demand, on the Spot market or pre-purchasing (Reserved Instances).
- Network Performance: For some instance types, AWS uses general terms like Low, Medium, and High to refer to network performance. Users have done benchmarking to provide expectations for what these terms can mean.
EC2 Alternatives and Lock-In
- Running EC2 is akin to running a set of physical servers, as long as you don’t do automatic scaling or tooled cluster setup. If you just run a set of static instances, migrating to another VPS or dedicated server provider should not be too hard.
- ?Alternatives to EC2: The direct alternatives are Google Cloud, Microsoft Azure, Rackspace, DigitalOcean, AWS’s own Lightsail offering, and other VPS providers, some of which offer similar APIs for setting up and removing instances. (See the comparisons above.)
- Should you use Amazon Linux? AWS encourages use of their own Amazon Linux, which is evolved from Red Hat Enterprise Linux (RHEL) and CentOS. It’s used by many, but others are skeptical. Whatever you do, think this decision through carefully. It’s true Amazon Linux is heavily tested and better supported in the unlikely event you have deeper issues with OS and virtualization on EC2. But in general, many companies do just fine using a standard, non-Amazon Linux distribution, such as Ubuntu or CentOS. Using a standard Linux distribution means you have an exactly replicable environment should you use another hosting provider instead of (or in addition to) AWS. It’s also helpful if you wish to test deployments on local developer machines running the same standard Linux distribution (a practice that’s getting more common with Docker, too. Amazon now supports an official Amazon Linux Docker image, aimed at assisting with local development on a comparable environment, though this is new enough that it should be considered experimental). Note that the currently-in-testing Amazon Linux 2 supports on-premise deployments explicitly.
- EC2 costs: See the section on this.
- ?Picking regions: When you first set up, consider which regions you want to use first. Many people in North America just automatically set up in the us-east-1 (N. Virginia) region, which is the default, but it’s worth considering if this is best up front. You’ll want to evaluate service availability (some services are not available in all regions), costing (baseline costs also vary by region by up to 10-30% (generally lowest in us-east-1 for comparison purposes)), and compliance (various countries have differing regulations with regard to data privacy, for example).
- Instance types: EC2 instances come in many types, corresponding to the capabilities of the virtual machine in CPU architecture and speed, RAM, disk sizes and types (SSD or magnetic), and network bandwidth.
- Selecting instance types is complex since there are so many types. Additionally there are different generations, released over the years.
- ?Use the list at ec2instances.info to review costs and features. Amazon’s own list of instance types is hard to use, and doesn’t list features and price together, which makes it doubly difficult.
- Prices vary a lot, so use ec2instances.info to determine the set of machines that meet your needs and ec2price.comto find the cheapest type in the region you’re working in. Depending on the timing and region, it might be much cheaper to rent an instance with more memory or CPU than the bare minimum.
- Turn off your instances when they aren’t in use. For many situations such as testing or staging resources, you may not need your instances on 24/7, and you won’t need to pay EC2 running costs when they are suspended. Given that costs are calculated based on usage, this is a simple mechanism for cost savings. This can be achieved using Lambda and CloudWatch, an open source option like cloudcycler, or a SaaS provider like GorillaStack. (Note: if you turn off instances with an ephemeral root volume, any state will be lost when the instance is turned off. Therefore, for stateful applications it is safer to turn off EBS backed instances).
- Dedicated instances and dedicated hosts are assigned hardware, instead of usual virtual instances. They are more expensive than virtual instances but can be preferable for performance, compliance, financial modeling, or licensing reasons.
- 32 bit vs 64 bit: A few micro, small, and medium instances are still available to use as 32-bit architecture. You’ll be using 64-bit EC2 (“amd64”) instances nowadays, though smaller instances still support 32 bit (“i386”). Use 64 bit unless you have legacy constraints or other good reasons to use 32.
- HVM vs PV: There are two kinds of virtualization technology used by EC2, hardware virtual machine (HVM) and paravirtual (PV). Historically, PV was the usual type, but now HVM is becoming the standard. If you want to use the newest instance types, you must use HVM. See the instance type matrix for details.
- Operating system: To use EC2, you’ll need to pick a base operating system. It can be Windows or Linux, such as Ubuntu or Amazon Linux. You do this with AMIs, which are covered in more detail in their own section below.
- Limits: You can’t create arbitrary numbers of instances. Default limits on numbers of EC2 instances per account vary by instance type, as described in this list.
- ❗Use termination protection: For any instances that are important and long-lived (in particular, aren’t part of auto-scaling), enable termination protection. This is an important line of defense against user mistakes, such as accidentally terminating many instances instead of just one due to human error.
- SSH key management:
- When you start an instance, you need to have at least one ssh key pair set up, to bootstrap, i.e., allow you to ssh in the first time.
- Aside from bootstrapping, you should manage keys yourself on the instances, assigning individual keys to individual users or services as appropriate.
- Avoid reusing the original boot keys except by administrators when creating new instances.
- Avoid sharing keys and add individual ssh keys for individual users.
- GPU support: You can rent GPU-enabled instances on EC2 for use in machine learning or graphics rendering workloads.
- There are three types of GPU-enabled instances currently available:
- The P3 series offers NVIDIA Tesla V100 GPUs in 1, 4 and 8 GPU configurations targeting machine learning, scientific workloads, and other high performance computign applications.
- The P2 series offers NVIDIA Tesla K80 GPUs in 1, 8 and 16 GPU configurations targeting machine learning, scientific workloads, and other high performance computign applications.
- The G3 series offers NVIDIA Tesla M60 GPUs in 1, 2, or 4 GPU configurations targeting graphics and video encoding.
- AWS offers two different AMIs that are targeted to GPU applications. In particular, they target deep learning workloads, but also provide access to more stripped-down driver-only base images.
- AWS offers both an Amazon Linux Deep Learning AMI (based on Amazon Linux) as well as an Ubuntu Deep Learning AMI. Both come with most NVIDIA drivers and ancillary software (CUDA, CUBLAS, CuDNN, TensorFlow, PyTorch, etc.) installed to lower the barrier to usage.
- ⛓ Note that using these AMIs can lead to lock in due to the fact that you have no direct access to software configuration or versioning.
- ? The compendium of frameworks included can lead to long instance startup times and difficult-to-reason-about environments.
- ?As with any expensive EC2 instance types, Spot instances can offer significant savings with GPU workloads when interruptions are tolerable.
- There are three types of GPU-enabled instances currently available:
- All current EC2 instance types can take advantage of IPv6 addressing, so long as they are launched in a subnet with an allocated CIDR range in an IPv6-enabled VPC.
EC2 Gotchas and Limitations
- ❗Never use ssh passwords. Just don’t do it; they are too insecure, and consequences of compromise too severe. Use keys instead. Read up on this and fully disable ssh password access to your ssh server by making sure ‘PasswordAuthentication no’ is in your /etc/ssh/sshd_config file. If you’re careful about managing ssh private keys everywhere they are stored, it is a major improvement on security over password-based authentication.
- ?For all newer instance types, when selecting the AMI to use, be sure you select the HVM AMI, or it just won’t work.
- ❗When creating an instance and using a new ssh key pair, make sure the ssh key permissions are correct.
- ?Sometimes certain EC2 instances can get scheduled for retirement by AWS due to “detected degradation of the underlying hardware,” in which case you are given a couple of weeks to migrate to a new instance
- If your instance root device is an EBS volume, you can typically stop and then start the instance which moves it to healthy host hardware, giving you control over timing of this event. Note however that you will lose any instance store volume data (ephemeral drives) if your instance type has instance store volumes.
- The instance public IP (if it has one) will likely change unless you’re using Elastic IPs. This could be a problem if other systems depend on the IP address.
- ?Periodically you may find that your server or load balancer is receiving traffic for (presumably) a previous EC2 server that was running at the same IP address that you are handed out now (this may not matter, or it can be fixed by migrating to another new instance).
- ❗If the EC2 API itself is a critical dependency of your infrastructure (e.g. for automated server replacement, custom scaling algorithms, etc.) and you are running at a large scale or making many EC2 API calls, make sure that you understand when they might fail (calls to it are rate limited and the limits are not published and subject to change) and code and test against that possibility.
- ❗Many newer EC2 instance types are either EBS-only, or backed by local NVMe disks assigned to the instance. Make sure to factor in EBS performance and costs when planning to use them.
- ❗If you’re operating at significant scale, you may wish to break apart API calls that enumerate all of your resources, and instead operate either on individual resources, or a subset of the entire list. EC2 APIs will time out! Consider using filters to restrict what gets returned.
- ❗⏱ Instances come in two types: Fixed Performance Instances (e.g. M3, C3, and R3) and Burstable Performance Instances (e.g. T2). A T2 instance receives CPU credits continuously, the rate of which depends on the instance size. T2 instances accrue CPU credits when they are idle, and use CPU credits when they are active. However, once an instance runs out of credits, you’ll notice a severe degradation in performance. If you need consistently high CPU performance for applications such as video encoding, high volume websites or HPC applications, it is recommended to use Fixed Performance Instances.
- Instance user-data is limited to 16 KB. (This limit applies to the data in raw form, not base64-encoded form.) If more data is needed, it can be downloaded from S3 by a user-data script.
- Very new accounts may not be able to launch some instance types, such as GPU instances, because of an initially imposed “soft limit” of zero. This limit can be raised by making a support request. See AWS Service Limits for the method to make the support request. Note that this limit of zero is not currently documented.
- Since multiple AWS instances all run on the same physical hardware, early cloud adopters encountered what became known as the Noisy Neighbor problem. This feeling of not getting what you are paying for led to user frustration, however “steal” may not be the best word to describe what’s actually happening based on a detailed explanation of how the kernel determine steal time. Avoiding having CPU steal affect your application in the cloud may be best handled by properly designing your cloud architecture.
- AWS introduced Dedicated Tenancy in 2011. This allows customers to have all resources from a single server. Some saw this as a way to solve the noisy neighbor problem since only that customer uses the CPU. This approach comes with a significant risk if that physical system needed any type of maintenance. If a customer had 20 instances running using shared tenancy and one underlying server needed maintenance, only the instance on that server would go offline. If that customer had 20 instances running using dedicated tenancy, when the underlying server needs maintenance, all 20 instances would go offline.
- ?Only i3.metal type instances providing an ability to run Android x86 emulators on AWS at the moment.
- ? Homepage ∙ Documentation ∙ FAQ ∙ Pricing
- CloudWatch monitors resources and applications, captures logs, and sends events.
- CloudWatch monitoring is the standard mechanism for keeping tabs on AWS resources. A wide range of metrics and dimensions are available via CloudWatch, allowing you to create time based graphs, alarms, and dashboards.
- Alarms are the most practical use of CloudWatch, allowing you to trigger notifications from any given metric.
- Alarms can trigger SNS notifications, Auto Scaling actions, or EC2 actions.
- Alarms also support alerting when any M out of N datapoints cross the alarm threshold.
- Publish and share graphs of metrics by creating customizable dashboard views.
- Monitor and report on EC2 instance system check failure alarms.
- Using CloudWatch Events:
- Events create a mechanism to automate actions in various services on AWS. You can create event rules from instance states, AWS APIs, Auto Scaling, Run commands, deployments or time-based schedules (think Cron).
- Triggered events can invoke Lambda functions, send SNS/SQS/Kinesis messages, or perform instance actions (terminate, restart, stop, or snapshot volumes).
- Custom payloads can be sent to targets in JSON format, this is especially useful when triggering Lambdas.
- Using CloudWatch Logs:
- CloudWatch Logs is a streaming log storage system. By storing logs within AWS you have access to unlimited paid storage, but you also have the option of streaming logs directly to ElasticSearch or custom Lambdas.
- A log agent installed on your servers will process logs over time and send them to CloudWatch Logs.
- You can export logged data to S3 or stream results to other AWS services.
- CloudWatch Logs can be encrypted using keys managed through KMS.
- Detailed monitoring: Detailed monitoring for EC2 instances must be enabled to get granular metrics, and is billed under CloudWatch.
CloudWatch Alternatives and Lock-In
- CloudWatch offers fairly basic functionality that doesn’t create significant (additional) AWS lock-in. Most of the metrics provided by the service can be obtained through APIs that can be imported into other aggregation or visualization tools or services (many specifically provide CloudWatch data import services).
- ? Alternatives to CloudWatch monitoring services include NewRelic, Datadog, Sumo Logic, Zabbix, Nagios, Ruxit, Elastic Stack, open source options such as StatsD or collectd with Graphite, and many others.
- ? CloudWatch Log alternatives include Splunk, Sumo Logic, Loggly, LogDNA, Logstash, Papertrail, Elastic Stack, and other centralized logging solutions.
- Some very common use cases for CloudWatch are billing alarms, instance or load balancer up/down alarms, and disk usage alerts.
- You can use EC2Config to monitor watch memory and disk metrics on Windows platform instances. For Linux, there are example scripts that do the same thing.
- You can publish your own metrics using the AWS API. Incurs additional cost.
- You can stream directly from CloudWatch Logs to a Lambda or ElasticSearch cluster by creating subscriptions on Log Groups.
- Don’t forget to take advantage of the CloudWatch non-expiring free tier.
CloudWatch Gotchas and Limitations
- ?Metrics in CloudWatch originate on the hypervisor. The hypervisor doesn’t have access to OS information, so certain metrics (most notably memory utilization) are not available unless pushed to CloudWatch from inside the instance.
- ?You can not use more than one metric for an alarm.
- ?Notifications you receive from alarms will not have any contextual detail; they have only the specifics of the threshold, alarm state, and timing.
- ?By default, CloudWatch metric resolution is 1 minute. If you send multiple values of a metric within the same minute, they will be aggregated into minimum, maximum, average and total (sum) per minute.
- ?In July 2017, a new high-resolution option was added for CloudWatch metrics and alarms. This feature allows you to record metrics with 1-second resolution, and to evaluate CloudWatch alarms every 10 seconds.
- The blog post introducing this feature describes how to publish a high-resolution metric to CloudWatch. Note that when calling the PutMetricData API, StorageResolution is an attribute of each item you send in the MetricDataarray, not a direct parameter of the PutMetricData API call.
- ?Data about metrics is kept in CloudWatch for 15 months, starting November 2016 (used to be 14 days). Minimum granularity increases after 15 days.
- ? User guide
- AMIs (Amazon Machine Images) are immutable images that are used to launch preconfigured EC2 instances. They come in both public and private flavors. Access to public AMIs is either freely available (shared/community AMIs) or bought and sold in the AWS Marketplace.
- Many operating system vendors publish ready-to-use base AMIs. For Ubuntu, see the Ubuntu AMI Finder. Amazon of course has AMIs for Amazon Linux.
- AMIs are built independently based on how they will be deployed. You must select AMIs that match your deployment when using them or creating them:
- EBS or instance store
- PV or HVM virtualization types
- 32 bit (“i386”) vs 64 bit (“amd64”) architecture
- As discussed above, modern deployments will usually be with 64-bit EBS-backed HVM.
- You can create your own custom AMI by snapshotting the state of an EC2 instance that you have modified.
- AMIs backed by EBS storage have the necessary image data loaded into the EBS volume itself and don’t require an extra pull from S3, which results in EBS-backed instances coming up much faster than instance storage-backed ones.
- AMIs are per region, so you must look up AMIs in your region, or copy your AMIs between regions with the AMI Copyfeature.
- As with other AWS resources, it’s wise to use tags to version AMIs and manage their lifecycle.
- If you create your own AMIs, there is always some tension in choosing how much installation and configuration you want to “bake” into them.
- Baking less into your AMIs (for example, just a configuration management client that downloads, installs, and configures software on new EC2 instances when they are launched) allows you to minimize time spent automating AMI creation and managing the AMI lifecycle (you will likely be able to use fewer AMIs and will probably not need to update them as frequently), but results in longer waits before new instances are ready for use and results in a higher chance of launch-time installation or configuration failures.
- Baking more into your AMIs (for example, pre-installing but not fully configuring common software along with a configuration management client that loads configuration settings at launch time) results in a faster launch time and fewer opportunities for your software installation and configuration to break at instance launch time but increases the need for you to create and manage a robust AMI creation pipeline.
- Baking even more into your AMIs (for example, installing all required software as well and potentially also environment-specific configuration information) results in fast launch times and a much lower chance of instance launch-time failures but (without additional re-deployment and re-configuration considerations) can require time consuming AMI updates in order to update software or configuration as well as more complex AMI creation automation processes.
- Which option you favor depends on how quickly you need to scale up capacity, and size and maturity of your team and product.
- When instances boot fast, auto-scaled services require less spare capacity built in and can more quickly scale up in response to sudden increases in load. When setting up a service with autoscaling, consider baking more into your AMIs and backing them with the EBS storage option.
- As systems become larger, it common to have more complex AMI management, such as a multi-stage AMI creation process in which few (ideally one) common base AMIs are infrequently regenerated when components that are common to all deployed services are updated and then a more frequently run “service-level” AMI generation process that includes installation and possibly configuration of application-specific software.
- More thinking on AMI creation strategies here.
- Use tools like Packer to simplify and automate AMI creation.
- If you use RHEL instances and happen to have existing RHEL on-premise Red Hat subscriptions, then you could leverage Red Hat’s Cloud Access program to migrate a portion of your subscriptions to AWS, and thereby not having AWS charge you for RHEL subscriptions a second time. You can either use your own self-created RHEL AMI’s or Red Hat provided Gold Images that will be added to your private AMI’s once you sign up for Red Hat Cloud Access.
AMI Gotchas and Limitations
- ?Amazon Linux package versions: By default, instances based on Amazon Linux AMIs are configured point to the latest versions of packages in Amazon’s package repository. This means that the package versions that get installed are not locked and it is possible for changes, including breaking ones, to appear when applying updates in the future. If you bake your AMIs with updates already applied, this is unlikely to cause problems in running services whose instances are based on those AMIs – breaks will appear at the earlier AMI-baking stage of your build process, and will need to be fixed or worked around before new AMIs can be generated. There is a “lock on launch” feature that allows you to configure Amazon Linux instances to target the repository of a particular major version of the Amazon Linux AMI, reducing the likelihood that breaks caused by Amazon-initiated package version changes will occur at package install time but at the cost of not having updated packages get automatically installed by future update runs. Pairing use of the “lock on launch” feature with a process to advance the Amazon Linux AMI at your discretion can give you tighter control over update behaviors and timings.
- Cloud-Init Defaults: Oftentimes users create AMIs after performing customizations (albeit manually or via some tool such as Packer or Ansible). If you’re not careful to alter cloud-init settings that correspond to the system service (e.g. sshd, etc.) you’ve customized, you may find that your changes are no longer in effect after booting your new AMI for the first time, as cloud-init has overwritten them.Some distros have different files than others, but all are generally located in /etc/cloud/, regardless of distro. You will want to review these files carefully for your chosen distro before rolling your own AMIs. A complete reference to cloud-initis available on the cloud-init site. This is an advanced configuration mechanism, so test any changes made to these files in a sandbox prior to any serious usage.
Auto Scaling Basics
- ? Homepage ∙ User guide ∙ FAQ ∙ Pricing at no additional charge
- Auto Scaling Groups (ASGs) are used to control the number of instances in a service, reducing manual effort to provision or deprovision EC2 instances.
- They can be configured through Scaling Policies to automatically increase or decrease instance counts based on metrics like CPU utilization, or based on a schedule.
- There are three common ways of using ASGs – dynamic (automatically adjust instance count based on metrics for things like CPU utilization), static (maintain a specific instance count at all times), scheduled (maintain different instance counts at different times of day or on days of the week).
- ?ASGs have no additional charge themselves; you pay for underlying EC2 and CloudWatch services.
Auto Scaling Tips
- ? Better matching your cluster size to your current resource requirements through use of ASGs can result in significant cost savings for many types of workloads.
- Pairing ASGs with CLBs is a common pattern used to deal with changes in the amount of traffic a service receives.
- Dynamic Auto Scaling is easiest to use with stateless, horizontally scalable services.
- Even if you are not using ASGs to dynamically increase or decrease instance counts, you should seriously consider maintaining all instances inside of ASGs – given a target instance count, the ASG will work to ensure that number of instances running is equal to that target, replacing instances for you if they die or are marked as being unhealthy. This results in consistent capacity and better stability for your service.
- Autoscalers can be configured to terminate instances that a CLB or ALB has marked as being unhealthy.
Auto Scaling Gotchas and Limitations
- ?ReplaceUnhealthy setting: By default, ASGs will kill instances that the EC2 instance manager considers to be unresponsive. It is possible for instances whose CPU is completely saturated for minutes at a time to appear to be unresponsive, causing an ASG with the default ReplaceUnhealthy setting turned on to replace them. When instances that are managed by ASGs are expected to consistently run with very high CPU, consider deactivating this setting. If you do so, however, detecting and killing unhealthy nodes will become your responsibility.
- ? Homepage ∙ User guide ∙ FAQ ∙ Pricing
- EBS (Elastic Block Store) provides block level storage. That is, it offers storage volumes that can be attached as filesystems, like traditional network drives.
- EBS volumes can only be attached to one EC2 instance at a time. In contrast, EFS can be shared but has a much higher price point (a comparison).
- ⏱RAID: Use RAID drives for increased performance.
- ⏱A worthy read is AWS’ post on EBS IO characteristics as well as their performance tips.
- ⏱One can provision IOPS (that is, pay for a specific level of I/O operations per second) to ensure a particular level of performance for a disk.
- ⏱A single EBS volume allows 10k IOPS max. To get the maximum performance out of an EBS volume, it has to be of a maximum size and attached to an EBS-optimized EC2 instance.
- ?Standard EBS volumes improve IOPS with size. It may make sense for you to simply enlarge a volume instead of paying for better performance explicitly. This can in many cases reduce costs by 2/3.
- A standard block size for an EBS volume is 16kb.
EBS Gotchas and Limitations
- ❗EBS durability is reasonably good for a regular hardware drive (annual failure rate of between 0.1% – 0.2%). On the other hand, that is very poor if you don’t have backups! By contrast, S3 durability is extremely high. If you care about your data, back it up to S3 with snapshots.
- ?EBS has an SLA with 99.99% uptime. See notes on high availability below.
- ❗EBS volumes have a volume type indicating the physical storage type. The types called “standard” (st1 or sc1) are actually old spinning-platter disks, which deliver only hundreds of IOPS — not what you want unless you’re really trying to cut costs. Modern SSD-based gp2 or io1 are typically the options you want.
- ❗When restoring a snapshot to create an EBS volume, blocks are lazily read from S3 the first time they’re referenced. To avoid an initial period of high latency, you may wish to use dd or fio as per the official documentation.
- ? Homepage ∙ User guide ∙ FAQ ∙ Pricing
- ?EFS is Amazon’s network filesystem. It’s presented as an NFSv4.1 server. Any compatible NFSv4 client can mount it.
- It is designed to be highly available and durable and each EFS file system object is redundantly stored across multiple availability zones.
- EFS is designed to be used as a shared network drive and it can automatically scale up to petabytes of stored data and thousands of instances attached to it.
- EFS can offer higher throughput (multiple gigabytes per second) and better durability and availability than EBS (see the comparison table), but with higher latency.
- EFS is priced based on the volume of data stored, and costs much more than EBS; it’s in the ballpark of three times as much compared to general purpose gp2 EBS volumes.
- ⏱ Performance is dependent on the volume of data stored, as is the price:
- Like EBS, EFS uses a credit based system. Credits are earned at a rate of 50 KiB/s per GiB of storage and consumed in bursts during reading/writing files or metadata. Unlike EBS, operations on metadata (file size, owner, date, etc.) also consume credits. The BurstCreditBalance metric in CloudWatch should be monitored to make sure the file system doesn’t run out of credits.
- Throughput capacity during bursts is also dependent on size. Under 1 TiB, throughput can go up to 100 MiB/s. Above that, 100 MiB/s is added for each stored TiB. For instance, a file system storing 5 TiB would be able to burst at a rate of 500 MiB/s. Maximum throughput per EC2 instance is 250 MiB/s.
- EFS has two performance modes that can only be set when a file system is created. One is “General Purpose”, the other is “Max I/O”. Max I/O scales higher, but at the cost of higher latency. When in doubt, use General Purpose, which is also the default. If the PercentIOLimit metric in CloudWatch hovers around 100%, Max I/O is recommended. Changing performance mode means creating a new EFS and migrating data.
- High availability is achieved by having mount targets in different subnets / availability zones.
- With EFS being based on NFSv4.1, any directory on the EFS can be mounted directly, it doesn’t have to be the root directory. One application could mount fs-12345678:/prog1, another fs-12345678:/prog2.
- User and group level permissions can be used to control access to certain directories on the EFS file system.
- ⏱ Sharing EFS filesystems: One EFS filesystem can be used for multiple applications or services, but it should be considered carefully:Pros:
- Because performance is based on total size of stored files, having everything on one drive will increase performance for everyone. One application consuming credits faster than it can accumulate might be offset by another application that just stores files on EFS and rarely accesses them.
- Since credits are shared, if one application over-consumes them, it will affect the others.
- A compromise is made with regards to security: all clients will have to have network access to the drive. Someone with root access on one client instance can mount any directory on the EFS and they have read-write access to all files on the drive, even if they don’t have access to the applications hosted on other clients. There isn’t a no-root-squash equivalent for EFS.