Deploying software updates to physical devices is not easy. Devices operating in the real-world face unique challenges, such as:
- Inability to update during operation.
- Limitations due to low power, limited resources, and bandwidth.
- Unreliable or unstable data connections.
When things go wrong, it’s easy to point fingers. Instead, we can learn from these mistakes. Explore five deployment disasters, identify the common pitfalls, and learn how to avoid them—or catch the 5-minute lightning talk at DevOpsDays Rockies for a quick rundown.
1. Over-the-air for a fast fix
Fisker’s USB struggles
The early version of the Fisker Ocean was shipped with issues that required a field technician to update the firmware over a USB cable located behind the vehicle’s kick panel. When popular YouTuber MKBHD opted not to wait for the update and reviewed the car as-is, it resulted in a viral video calling it "the worst car he had ever reviewed." In this case, the impact of delaying the updates extended beyond technical fixes and may have damaged the brand perception.
Lesson learned
Relying on manual updates for critical fixes can severely damage a product’s reputation. Implementing over-the-air updates ensures timely improvements and minimizes disruptions for users.
2. Don’t ignore network and device constraints
Nest’s thermostat failures
A large update pushed to Nest thermostats caused many devices to crash or lose power, leaving users without heat in the middle of winter. The update was not designed with network constraints in mind. In bandwidth-limited environments, large updates can disable devices, especially if a connection drops mid-update.
Lesson learned
Keep updates as small as possible, especially for low-bandwidth or low-power devices, and support resumable updates so that downloads can pick up where they left off without forcing users to start over or manually restart their devices.
3. Ensure comprehensive logging, monitoring, and fallbacks
Facebook’s global outage
Meta suffered a six-hour global outage caused by a failure that impacted the infrastructure backbone connection to all of their data centers worldwide. The outage affected Facebook (now Meta) and all its subsidiaries.
The outage also took down their internal communication tools, preventing engineers from diagnosing the issue promptly. In some cases, engineers were physically locked out of the data centers and blocked from fixing the issue right away.
Lesson learned
Without sufficient logging and monitoring, diagnosing the root cause of such outages becomes much harder, particularly in large, distributed systems. Additionally, if your primary communication channel with hardware fails, it’s crucial to have redundant systems in place.
4. Communicate clearly with users
Rivian’s missteps
Rivian suffered a problematic update despite rigorous testing on production vehicles for months. When the update was deployed to Rivian users, an engineer mistakenly selected the wrong security certificate, impacting the first stage of a canary rollout (3% of infotainment systems). As a result, drivers couldn’t view their speed or remaining range.
Rivian’s VP of Software first let users know what was happening through a post on Reddit, but official communication from Rivian didn’t follow until the next day.
Rivian’s decision to communicate through unofficial channels, such as Reddit, led to confusion and frustration among users.
Lesson learned
When things go wrong, clear communication can make a big difference in user perception. Prepare a communication strategy ahead of time as part of your incident response plan, so that in the event of an issue, you can provide timely and transparent updates through official channels to manage expectations and maintain user trust.
5. Automate testing and roll out updates incrementally
Crowdstrike’s global outage
Crowdstrike experienced a major outage after an update intended to enhance security inadvertently caused disruptions across over 8 million systems, resulting in $10 billion in financial damages worldwide. The root cause was traced to an incomplete automated test that failed to account for certain edge cases, like invalid inputs and unhappy paths.
The issue was exacerbated by the fact that the update was rolled out broadly rather than incrementally, affecting a significant number of users at once.
Lesson learned
User testing is chaos testing. Neglecting to test invalid inputs and unhappy paths can lead to major failures in production. Automate regression testing and implement progressive rollouts, such as canary deployments to test updates on a small subset of devices before a system-wide release, or blue-green deployments to quickly restore functionality without waiting for a fix. These strategies, paired with rigorous edge-case testing, can catch issues early and prevent widespread outages.
How Viam facilitates software deployments to physical devices
Many robotics and industrial IoT systems are composed of low-power and low-bandwidth devices, or operate in poor network conditions. Viam is an open-source software platform for the physical world, developed specifically to help smart devices, IoT systems, and robots communicate more reliably.
It uses gRPC for fast, structured client-server communication and WebRTC for direct, peer-to-peer communication between physical devices like a Linux server, Raspberry Pi, or ESP32.
Besides machine communication, Viam’s fleet management and logging capabilities were created to mitigate risks associated with large-scale updates to physical devices in the following ways:
- Over-the-air for real-time fixes: Viam’s fleet management utilities were designed to remotely deploy configuration or code changes to multiple devices across various locations efficiently, ensuring timely updates.
- Account for network and device constraints: Viam provides considerations for deployments made in poor network conditions by providing configuration options, such as automatic restarts, configurable timeouts for connection, and caching of previous versions of deployments.
- Comprehensive logging and monitoring: For remote diagnostics, Viam offers detailed logging, the ability to query sensor data, and access machine diagnostics for real-time insights. Additionally, you can always configure a webhook for alerting when a machine is down, for better monitoring and faster issue resolution.
- Progressive rollouts to mitigate risks: Viam enables version control and semantic releases for software deployments. For example, you can pin specific versions of packages using fragments, roll back problematic updates, and automate incremental rollouts to canary locations before rolling out to the rest of the fleet for safer deployments.
While you could build out a similar infrastructure on your own or cobble together a suite of different technologies, Viam provides this capability with off-the-shelf robotics software. The on-machine functionality is open-source and free to operate. If you eventually scale up to manage a fleet, it’s usage-based billing for cloud services and data storage.
Build more resilient software deployment pipelines
No matter the architecture of your machines or the tech stack you’re using, some of the same risks exist when deploying updates to physical devices as with software deployments. And some unique challenges arise because you’re talking about real-world environments.
Instead of just focusing on disaster recovery, plan ahead to design more resilient deployment processes. From implementing comprehensive testing to adopting tooling for better observability and monitoring, there’s a lot to be done to prevent these kinds of deployment disasters.
Learn more about how Viam can help you work with devices in the physical world, from fleet management to predictive maintenance. And if you’re using Viam to manage your fleet, let us know about it on the Viam Discord community in the #built-on-viam channel.