HYLA TechTalk: Introducing the first of many blogs to come which will provide insights into the technology we develop and data we utilize across our analytics, trade-in, insurance and processing solutions utilized by carriers, OEMs, retailers and throughout HYLA.
Recently, here at HYLA Mobile, our team pushed forward and got the entire trade-in and processing platform into Amazon Web Services (AWS). Along the way we made few application related improvements to leverage the benefits from the target environment. A recent LinkedIn post resulted in a few folks asking us about what we learned along the way. Here is a detailed write-up on our experience pushing through this change, and we hope that this helps others who are on a similar journey.
- Buy-In: Moving to any cloud environment from a legacy environment is not a trivial activity. Hence, it is important that the change is not treated as a hobby project, but has the support from your senior leadership. This move should be recognized as an important strategic initiative and confirmed to be strategically aligned with business goals. On the execution side, it is important to ensure that there is an excellent project management team, commitment from the execution staff, and most importantly, good old-fashioned grit to see things through.
- Leading Change: The mechanics of pushing through the change boils down to (1) leading the change from the front (2) influencing the team to push through the change (3) issuing top-down edicts to execute the change. However, any change implementation comes with a degree of Fear, Uncertainty, and Doubt (FUD). The leadership stamina to see through the change is directly proportional to intensity of FUD. Accordingly, appropriate strategy needs to be applied to push through the change. If the leader has the ability and bandwidth to embrace this strategy, the probability of success is relatively high. However, if it is being pushed top-down, then the leader(s) must ensure that the team is adequately staffed and trained to successfully manage through the change.
- Architecture: While it sounds trite, it is important to pen down high-level and next-level architectural diagrams that include the low-level details like VPC, subnets for various tiers, ACL for the subnets, and security groups for EC2, including ports. Also, It is important to keep in mind the various services that may be needed to design the architecture: SSH (22), SMTP (25), Tomcat (8080), etc. Using the architecture as a blueprint, cloud formation, or other scripting, needs to be written to build the infrastructure.
- Application State: When porting over legacy applications this is one area that most likely will cause a lot of heartburn. The underlying issue is what Martin Fowler calls "Snowflake Server". This is where folks need to spend energy to decouple the application state from the environment. The long pole here happens to be property files. The best way to tackle this would be to pivot to something like Cloud Config, Zookeeper, or Consul. However, due to timeline pressure, it may be hard to pivot, and in those cases S3 could be leveraged to store the application state and configuration files.
- AWS Accounts: Before building anything, it is important to think through accounts and hierarchies. One could design a fine-grained hierarchy or stay coarse-grained, but the final design needs to be driven by department or company objectives. In our case, we just needed four separate accounts for each environment - prod, uat, qa and dev. However, in a larger organization, it may be a wise idea to put deeper thoughts into account organization.
- VPC and CIDR Ranges: It is equally important to put thought into segregating CIDR ranges based on environment and business domains. In our case, we had to go through a few iterations to pick the right CIDR range for dev, qa, uat and prod. This could have been avoided if time was spent early on.
- Building Up Infrastructure: Building by hand through console is great for learning. However, folks need to invest time and energy to build up the infrastructure using CloudFormation, TerraForm, or CloudFormation Template Generator (from Monsanto Engineering). In our case, we ended up using CloudFormation. Once the scripts are in place, it is important to start treating them as code, specifically, infrastructure as code. This idea needs to get ingrained as part of the software organization culture that infrastructure is no different from the rest of the code base. In our case, the cloud formation scripts are in Git and going forward, changes to the environment will get no different treatment than changes to code supporting our product suites.
- Service Limits: It is a good idea to be aware of what the limits are and make requests for adds ahead of time. It may not be ideal when an application under load, trying to scale up, hits the limits and breaks down.
- Accessing EC2: If set up correctly, only a few will be needing SSH access to EC2. In fact, in a well-automated state, even SSH access may not be required. One of the reasons developers need access to EC2 instances is to view application logs. The logs, however, can be piped to CloudWatch Logs and if the IAM is set up correctly, this should address the need for accessing the logs for debugging purposes. Another strategy would be to send all the log data to ElasticSearch, which is actually the most ideal solution. This would not only enable enhanced search capabilities, but also opens up opportunities to perform log analytics through Kibana.
- Static IPs: In the cloud environment, there is little to no need for static IPs . However, this idea requires a little bit of getting used to, especially when we have gotten used to fixed IPs throughout our software life. In our case, only NAT Gateways have Elastic IPs. Pretty much everything else in our environment have virtual IPs and almost all of them are private too. The SSH Bastions have public IPs but are not static, so if the cloud formation that was used to build up the bastion were to be deleted and redone, the bastions would get new IPs. We felt this is OK given the fact that only a few had access.
- Private IPs: Almost all of our IPs are private and none of them are visible to the outside world. The public IPs are for NAT Gateways, external facing ELBs and Bastions. One can access the private IPs only from the bastion. Initially, this process caused a bit of pain because every time we needed to SSH to our EC2 resource we had to figure out what the IP was. This meant logging into the console to see the private IP. This process required few more clicks than earlier. However, with automation leveraging AWS, this problem was being aggressively tackled by our capable DevOps team.
- ASG: To scale, we had set up CPU High and Low Alarms. Here too, it is a good idea to put some thought into what the high threshold and low threshold should look like. At one point our application servers were trashing pretty bad. In the middle of debugging, the server just powered off. The shutdown felt arbitrary with no apparent reasons. We chased our tail, thinking that the environment was "unstable", suspecting something was wrong with the UserData part of the EC2. In the end, it turned out that the High CPU Alarm threshold was not right. The bar was too low, and when the application hit the low bar for the high threshold, ASG terminated the instance and replaced with new instance which then terminated promptly. Resetting the High CPU Alarms for Auto Scaling brought stability and relief.
- Tags: Putting thoughts into tags is extremely important. Tags are free form text and hence it is important to establish a solid naming convention for all the resources and diligently sticking to it. This has potential to become chaotic if not controlled from the beginning.
- SSL Termination: Terminating SSL in ELB, offloads the SSL overhead away from the web servers. In addition, AWS provides nicely packaged, predefined policies for security, which makes security a breeze.
- RDS: Going down this route takes away a lot of freedom that comes with, say, setting up Postgres on EC2 (or MySQL on EC2). AWS retains the rights of true "superuser" and the admin user is limited to a restricted set of privileges. For legacy applications this is another area where people may have to spend time cleaning up. Another neat thing about RDS is that encrypting data at rest is a breeze. However, it might be a good idea to generate a key from KMS and use it rather than using the default one.
- IAM Groups and Users: Time needs to be put into the design and build of IAM groups with an appropriate set of permissions. Users can be assigned to the groups which gives better control over limiting permissions as well as achieving well thought-out separation of responsibilities.
- Getting Help: The free support through AWS Forums is totally useless. Questions go unanswered. Investing dollars for support is well worth it, because of reasons mentioned in the next bullet.
- Still Not Perfect: AWS is not yet perfect. For instance, during our production DB build out, Read Only Replica failed for an unknown reason. It took multiple attempts with some help from AWS support to get rid of the zombie read only replica that sat in a limbo state for 12+ hours. During another time, we encountered an issue with the Cloudformation script. Specifically, we ran into situations where we were unable to delete a script because it relied on another script that was deleted successfully during an earlier time. The error message indicated that the script couldn't be deleted because it used an export from the other script that was long gone, but managed to remain behind the curtains in a phantom state.
- /var/log/cloud-init-output: During the build out phase, reviewing the output log in this location makes debugging UserData a breeze. The output
clearly tells you what went wrong.
- CodeDeploy Woes: We used the "AutoScalingGroups" bit in the "AWS::CodeDeploy::DeploymentGroup". However, every now and then, the ASG went into a weird state. To fix this state, meant we had to clean up things manually, which involved getting a list of ASG life-cycle hooks and then identifying the one for CodeDeploy and then manually deleting it using CLI. When this became a recurring pain, we switched over to Ec2TagFilters which made life alot easier.
- CloudFormation: Keeping the scripts small and building one bit on top of another keeps the scripts organized, manageable and error free. We started with monolithic scripts with thousands and thousands of lines of code. We quickly realized this was going to be problematic, and pivoted over to breaking it apart. So we built the core infrastructure (VPC,Internet Gateway, Nat Gateway, Route Tables, Routes, etc.), followed by web infrastructure (ELB, SG, etc.), webserver (ASG, Alarms, etc), appserver, etc. We built up one after another using exports from the previous script.
- Lambda: We used Lambdas to execute custom logic in CodePipeline. The custom logic involved executing shell scripts in EC2 instances and moving files from one S3 bucket to another. The shell scripts were executed from CodePipeline through Lambda and SSM (it is bit more complex that we like). In addition, we utilized Lambda to send EC2, ASG and RDS Alarms and CodePipeline Approvals to HipChat Room. We think Lambdas provide solid potential in the AWS environment to automate many manual tasks.
- AWS Lock-In: AWS provides an amazing set of tools (CLI) and SDK (Java, Python etc) that make automation a breeze. In addition, AWS is also starting to offer neat solutions for Code Build, Deploy, etc. that seamlessly interoperate with other AWS services and technology stacks. Leveraging more and more of these means we are tightly coupling the applications and processes to the "virtual" environment. Such coupling means, moving to another cloud provider like Azure or GCP in the future will be lot harder to execute. So, before digging deeper, it is important to evaluate the long term cloud strategy and have a clear view of the path being taken.
Note: There were areas that we just couldn't get to prior to the production push, but plan to tackle soon. (1) Evaluate the ELB health checks instead of EC2 to make an auto scaling determination (2) Evaluate federation option in lieu of of clustering to avoid a network partition issue which seemed to happen every now and then (3) Evaluate custom metrics instead of the free ones (4) Use Stacked Set for Cloudformation (5) CloudTrail for Audit (6) Granular Billing Alerts (7) Evaluate the use of reserved instances to save some more money (8) Explore Cloudian to reduce the cost even further.