Why do companies keep losing money when their systems crash?


Photo by Harold Vasquez on Pexels

Opinions expressed by Digital Journal contributors are their own.

System outages are costing American businesses more money than ever before. Every minute of unplanned downtime now costs companies an average of $14,000, transforming what were once minor technical problems into major financial disasters.

This happens far more often than anyone wants to discuss. Big companies spend tons of money on computer systems and hire lots of smart people to run them. But they still get surprised when things break. The costs keep going up, too. Most companies lose between $100,000 and $500,000 every hour their systems are down. Banks can’t process payments. Online stores lose shoppers. Hospitals can’t access patient records.

These issues are often handled in the same manner. Everyone panics when anything fails, and the IT crew spends the entire night fixing it. However, this only makes matters worse. People get burned out from working crazy hours. Customers get mad about bad service. And nobody ever figures out why things keep breaking.

The real problem is that computer systems today are complicated. Apps run on different servers all over the place. They talk to each other constantly. They handle thousands of requests every second. When something goes wrong, finding the cause is like looking for a lost contact lens in a swimming pool. Most monitoring tools only show you basic stuff, so teams end up guessing what broke.

Sagar Kesarpu ran into this exact problem while working at a big financial company. Through their Engineering Dojo program, he watched teams put out fires all day instead of preventing them. The monitoring setup they had only showed pieces of what was happening. When systems crashed, it took hours to figure out what went wrong, while customers couldn’t use the service.

Building better ways to see what’s happening decided to start over. Rather than repairing the outdated monitoring system, he constructed a new one that could see everything that was going on with all of their technologies. He monitored server performance and user satisfaction using tools like Datadog, Instana, Splunk, Prometheus, and Grafana.

But just installing these tools wasn’t enough. The hard part was making dashboards that helped people understand what was going on. He set up the ELK Stack (that’s Elasticsearch, Logstash, and Kibana) to look at application logs as they happened. Log4j captured detailed information that helped teams trace problems back to where they started.

The big change was moving past simple measurements like how much memory servers were using. He created Service Level Indicators and Service Level Objectives that showed how technical problems affected real business results. Instead of watching random numbers go up and down, teams could see how system performance hurt or helped customer experience and company profits.

“We quit collecting information just because we could,” Sagar Kesarpu says. “Every piece of data we tracked had to help us answer real questions about how customers were affected.”

Spotting trouble before customers complain

The old way of doing things waited for customers to call and complain before anyone knew there was a problem. He turned this around by building systems that could see trouble coming before it hit users. This meant creating smart alerts that could tell the difference between normal ups and downs and real problems that needed fixing.

The platform used machine learning to figure out what normal looked like for each system. This cut way down on false alarms while making sure real problems got attention from the right people right away. Teams could fix things faster when they had good information about what was wrong.

He also started using chaos engineering with a tool called Gremlin. This meant breaking things on purpose in safe ways to see how systems would handle real problems. It sounds nuts, but this approach found weak spots before they caused real outages. Teams learned how their systems acted under pressure and figured out better ways to handle failures.

ServiceNow helped organize incident tracking and meetings after problems got fixed. Instead of pointing fingers, these meetings became chances for teams to learn why problems happened and how to stop similar ones.

“We changed how people thought about system problems,” Sagar Kesarpu explains. “Teams started looking for issues so they could fix them during regular work hours instead of getting calls in the middle of the night.”

Real results that made a difference

The changes showed up in ways you could actually measure. Teams fixed problems faster because they had better information about what was broken. Systems stayed up more because problems got caught early. Customers complained less because services worked more reliably.

The monitoring setup also helped with planning. Teams could see how people used the systems instead of guessing about what they needed. This prevented both slow performance from not having enough capacity and wasted money from buying too much equipment.

The Engineering Dojo approach helped spread knowledge around to different teams. This meant critical systems didn’t depend on just one or two people who knew everything. It reduced the risk of being stuck when key people quit or go on vacation.

Besides avoiding the cost of outages, being proactive saved money in other ways. Teams worked less overtime fixing emergency problems. Computer resources were used more efficiently. New software deployments worked better because teams understood how changes would affect performance.

The money saved was significant. Instead of losing hundreds of thousands of dollars every time systems went down, the company prevented most problems before they affected customers. Teams could spend time building new features instead of constantly fixing broken stuff.

Businesses that prioritize system stability get significant competitive advantages. Without worrying about technical constraints, they may launch new features more quickly, improve customer service, and adjust to business changes. These skills are no longer nice-to-haves but necessities as more businesses go online and consumers demand that everything function flawlessly.



Source link

By Admin

Leave a Reply

Your email address will not be published. Required fields are marked *