When your connected product begins to sell and business grows, you may find yourself in a situation familiar to many who’ve traveled on this road before: Your infrastructure is stretched and starting to show signs of stress.
Connected products typically depend on some sort of cloud-based infrastructure to power their features and when you start to amass a good number of devices in the field, they can generate quite a bit of traffic. When your system is stressed, you do sometimes get subtle hints, such as slower response times to web service requests. Unfortunately, there are just as many times when the hints are less subtle, such as a database crashing hard and causing prolonged downtime.
Let’s look at some of the things connected product companies need to think about when they grow so they can scale gracefully.
Most cloud-based infrastructures start with servers running with some variation of Linux or Microsoft product. Then there are applications which run on these servers which communicate with connected products. When these applications are pushed and the servers start to have resource issues – high CPU usage, RAM usage, etc. – one must next decide how to scale. The first thing most think of is what is called vertical scaling: Make the server bigger and give the application more resources, like more CPU cores or more RAM. An alternate approach and often the route we recommend for long term stability, is to scale horizontally: Add more servers to share the burden.
The benefit of scaling vertically is that it is generally quite easy to do. Cloud service providers like Amazon Web Services and Microsoft Azure have given their customers the ability to make many of their servers grow or shrink at the push of a button. Growing a server to meet demand can be done quite quickly in these situations. The downside is that this only works for so long before a server cannot grow any bigger. In the meantime you only create a bigger single-point failure in your system (more on redundancy below). Scaling horizontally, on the other hand, typically takes more time to engineer up front because of the increased complexity but provides a longer-term solution to growth.
When looking to build growth capacity to your cloud architecture, we also recommend you look beyond the server-based solution to what you might call managed solutions. There are various products among cloud service companies that provide some of the most of the common functions that were historically built into servers. For instance, instead of installing and maintaining an MQTT broker on your servers for your connected product fleet, we’d recommend you consider a managed service like Amazon IoT or Azure IoT Hub. These offerings provide seamless scalability as the provider, Amazon or Microsoft in this case, are offering this as a key feature – no need to manage servers or worry about horizontal vs. vertical scaling.
An important part of supporting a large user base is redundancy and the elimination of single-point failures. All computer hardware fails and servers are no exception. As you use more resources the probability you’ll experience a failure in your infrastructure only goes up. This means that as your company grows it’s important to eliminate single point failures. For instance, you should have more than one web server to handle incoming traffic. This redundancy helps your infrastructure meet demand when you receive an unexpected burst in traffic, but it also provides a way forward when one server’s disk unexpectedly fills because of a log file runaway.
The rule of thumb is clear, no single item in your architecture should be able to bring the whole system down when it fails.
This goes hand-in-hand with the previous topic, but backing up your data is critical to business continuity in the case disaster strikes. As you grow it becomes even more important that you can recover when things go south. If your database gets corrupted or a hacker finds a vulnerability that causes data loss, you need a way forward.
In this example, your databases should have daily snapshots which can be restored. More importantly, this should all be automated – manual backups typically mean you’re not doing it enough. Think of it this way, if you do weekly backups, are you prepared for a week’s worth of data to disappear? And it’s not only the data that should be backed up, but also encryption keys, SSH keys, code bases, server configurations, etc.
It is prudent to occasionally do a recovery drill and try to restore your systems from a back-up. It’s not uncommon to have automated back-ups randomly stop working and you’ll want to know when it happens so you don’t find out after it is too late.
System security seems to be one of those things that people think just happens if you’re doing development right. All too often, unfortunately, it is never a real priority until after the first security incident.
It is worth mentioning here that as you grow security should be a priority before someone else (indirectly and maliciously) makes it a priority for you. A bigger infrastructure means you have a bigger attack surface, more possible egress points for hackers to find. And as the system grows beyond what any one engineer can wrap their head around, it becomes even more important that you have a plan and a process to ensure your infrastructure and the data stored there is safe. Tools like threat modeling, red teaming and vulnerability scanning should be in your team’s arsenal.
When a service offering is small and simple, it is fairly easy to know when something goes wrong. The same might be said of finding the cause of an issue and fixing it. But as the system gets large and complex and becomes a combination of many different systems it is important to have it instrumented well, as an engineer might say.
If a developer or system administrator is to diagnose issues quickly and accurately, they need the tools and information to do so.
There are a lot of tools out there to help with this, but the main point is that performance and traffic statistics from each sub-component should be captured and aggregated for continual analysis and team alerting. As a product grows, the number customers that are adversely affected by performance issues also grows with it. It becomes critical that supporting teams know of issues in a timely manner – and even, if possible, before they occur.
Cloud-based infrastructures cannot grow on their own. They need a team behind them. And business growth does not simply come from maintenance but continual development. This requires a team.
So, how do you grow a team?
Well it’s important that you fill the right roles at the right time. As teams grow it will become more apparent when certain roles can no longer be shared but must be broken out as independent jobs (e.g., developer vs. dev ops or front-end vs. back-end developers). Hand in hand, your development process must also grow if you want to continue to deliver quality products. Automated testing, use of distributed version control and development workflows, agile methodologies for task management – these are all things that teams must begin to learn and get good at when they grow.
That said, you don’t need to this alone. The DornerWorks product development team is here to help. Growth can be both an exciting and anxiety-provoking journey. Know that we are here, and it is our job to help you navigate the scaling of your infrastructure and development team.
If you’re looking for a guide to help you get ready for growth or keep up with success let us put our experience to work for you.
Contact us today and schedule a free meeting.