Adventures in Azure
In the last three years I have spent a lot of time working in Azure. In that time I've gone from throwing up some App Services and SQL Databases to provisioning a dozen different PaaS services and concerning myself with security, high availability and management.
Cloud has put more and more power in the hands of developers. New PaaS services can be spun up easily to meet the needs of the software being built, without the hassle of provisioning VMs and worrying about proxy settings and firewalls. Of course, with great power comes great responsibility and there is a risk of poor security implementations and cost blowouts for the unwary.
In this short article, I discuss some of my experiences working in Azure and my journey to maturity.
This is a simple one. My company was already invested in Azure. I'm certainly not going to dive into the merits of one cloud provider against another. Azure certainly is attractive to organisations that have Active Directory on-premises and/or Microsoft 365.
Moving your stable of virtual machines to the cloud may have many advantages for an infrastructure team although cost may not be one of them. For a developer, VMs can be in the cloud or in the Data Centre, but your hosting challenges remain the same.
Containerisation removes a lot of the challenges, but also adds their own. We were looking to build out ten applications with a few shared services and most importantly, scalability was not a big concern.
One of the simplest steps you can take when thinking about securing your code and config is to use Azure Key Vault. You can use Key Vault references directly in App Service configuration, Azure App Configuration or connect as a configuration source in .NET core. It's a great way to securely store passwords, keys and other secrets and keep them out of config files and CI/CD tool chains.
Of course, you find yourself storing too many passwords if you use Managed Identity. This allows you to assign an AAD identity to a resource and then give that identity access to other Azure resources, not least SQL databases. This drastically reduces the need to store keys and passwords and allows greater visibility of what access an application has.
How secure do you need to be? It's an important question to ask, especially when cost is a factor.
Many resource types in Azure sit behind firewalls. This gives you a certain level of control, but generally you'll want to have exception for other Azure services. It is very hard and often impossible to lock down access to just the services you own.
Virtual networking provides a much greater level of security at the cost of extra resources and management overheard. You'll also need a way into your virtual network via a VPN or ExpressRoute. Many PaaS resources support a technology called Private Link, that brings the resource into the virtual network. Other resources can be attached to the network so that they are still publicly visible, but also able to access protected resources
Designing for High Availability and Disaster Recovery
You might think, as I did when I started, that high availability (HA) would just be sorted. That either they just worked, or you could, at the click of a button, set up some replica in a different region.
The fact is, every PaaS service has its own approach to HA. I discuss some of the more interesting ones below.
For the purposes of this discussion, I am defining high availability as avoiding dependency on a single Azure region. Within a region, you can define high availability in terms of zones, scale-out or deployment slots.
Services that Just Work™
Some services truly support built-in HA without requiring any extra configuration or cost on your part. These include Key Vault, Event Grid and Front Door. However, be mindful of your recovery time objective (RTO) and whether these services will come back
Storage Accounts, Cosmos DB, Azure SQL Server all require minimal configuration to set up. For Azure SQL databases, Failover Groups will create a read-only replica of your database on one or more alternative regions. Naturally, you have to pay for the replica, but you can consider scaling down the replica if you have a read-heavy application with minimal writes.
Services that Require Code
Some services either do not provide cross-region capability out of the box, or do if you are willing to pay premium prices (Service Bus for example). In these instances, you can look at solutions in code to switch services if one fails.
Alternatively, you can just configure your services to only use the PaaS services in their region. Should a region go down, you know that the secondary region services will continue to work.
App Services are not highly available out of the box. You'll need to implement one of three technologies on top.
Traffic Manager will do the job cheaply and with no fuss. It'll balance load across multiple backends and direct traffic to an operating node if one goes down.
Application Gateway is a pricier option but with many more load balancing features. It works within a single region so is great when an app is distributed locally.
Front Door is a global edge service, ensuring that traffic is directed to the node geographically closest to the end user. However, it is not a feature-rich as Application Gateway, it does not support WebSockets for example.
Again, for the purposes of this discussion, I am defining disaster as the total loss of an Azure region. Another type of disaster includes a loss of Azure Active Directory, in which case recovery is out of your hands.
Less severe issues include the loss of a zone, a sub-division of an Azure region, or even the loss of a specific rack, although that is more an issue with VMs.
You may not be aware but certain services, such as storage, actually store multiple copies of the data within the region to buffer against these issues or planned maintenance works.
Testing PaaS services for disaster is the one of biggest headaches for an Azure Developer. You can turn off an App Service, force a DB failover, but how do you stop a Service Bus, or Event Grid?
Microsoft have released Azure Chaos Studio to help with this problem, but at time of writing covers precious few PaaS resource types.
Nothing is perfect, especially when you're learning on the go. I often felt like I was building the railway track just in front of the moving train. In this environment, mistakes are made and lessons learned. Here a couple of mine.
When deciding on a technology for pub/sub, I looked at both Service Bus Topics and Event Grid. I chose Event Grid as the newer technology that seemed to be way Microsoft was going. It was one of the few decisions I came to regret. This was for two reasons, the event schema being one, but mostly due to dead-letter handling.
Event Grid offers two schema types, Microsoft's own and the platform-agnostic Cloud Event Schema. I chose Microsoft and this came to bite me when I realised that I needed to add metadata to the event, such as a correlation ID. I was too late to easily change schemas so had to come up with a wholly unsatisfactory way of adding query string parameters to the Subject field.
More importantly, when I turned my attention to handling and resubmitting dead letters I realised it was not going to be easy to retrigger the failing subscriber. To explain, let's look at how Service Bus Topics work.
Every Topics subscriber get's its very own queue and its own dead letter queue. If one particular subscriber fails, it is straightforward to fix the bug and requeue the failed message just for that subscriber. Event Grid is nowhere near that sophisticated. The dead letters end up in a storage account and if you use Azure Functions with Event Grid triggers, there's no easy way to retrigger just the Function that failed. Of course there are solutions to this, such as adding HTTP triggers to your Functions or placing each Function behind a queue, but the work involved compared to what Service Bus Topics gives you out of the box, made me regret my decision.
Infrastructure as Code (IaC)
Don't leave it until the end. Or the middle. Or not do it at all. Defining your infrastructure in code allows for fast, reliable deployments. Resource and time constraints meant that for us, this work was delayed, and it became a much harder slog to deal with the inevitable differences between one product deployment and the next, even the same product across environments.
Together with a solid naming an tagging strategy, IaC means that your ever-increasing stable of stuff will be manageable.