My team split our Rails app into eight services last spring. We'd read all the Medium posts, set up our CI/CD pipeline, even did the database migration properly. We felt prepared.

Then production happened.

The first incident taught me something none of the blog posts mentioned: when a request fails, you need to know why across potentially a dozen services. Our old monolith logged everything to one place. Now? Error messages scattered across eight different CloudWatch streams.

We spent three weeks just getting basic observability working. Not the fancy stuff—just "which service threw this 500 error?" Distributed tracing saved us. We wired up OpenTelemetry, connected it to Jaeger, and suddenly those cryptic failures made sense. A request would touch the auth service, hit the API gateway, bounce through the inventory service, fail at the payment processor, and we could see the whole path with actual timing data.

The real cost wasn't the architecture refactor. It was the three months learning how to debug distributed systems properly. We burned through our sprint velocity. But now when something breaks at 2 AM, we don't spend four hours grepping logs. We open Jaeger, find the trace ID, and know exactly where it failed within minutes.

If you're considering microservices, budget time for this. The code migration is the easy part.