
GitLab.com: Spikes are Outages. OSD = Ceph Object Storage Daemon
After all, when you yoke a bunch of water buffalo together, your team is only as fast as the slowest buffalo.
But I find it fascinating and convenient that they’re doing all that distributed file system testing for me. Thanks, guys!
On the plus side, supporting a distributed file system is almost possible on homogeneous hardware …
Here’s some free consulting from somebody who works on x,000 to xx,000-server data centers:
- buy hardware compatible with Ceph
- use 10 Gbps switch ports
- use cluster-dedicated switches
- hire somebody already doing it now
- don’t goof up your health-checks. Include all healthy servers, not just the healthiest one
- or instead of using Ceph or Gluster, do it right. Implement Backblaze’s object store design. Invert the problem from being “the network and OSD has to always work” to something tractable like “my HTTP API has to work most of the time”. And use a combination of Arista Clos network design and HAProxy as the mesh router to avoid network hotpspots and SPOFs. Non-blocking and “Propah!” with multi-terabits per second sustained throughput! Now we’re talking!
“There is a threshold of performance on the cloud and if you need more, you will have to pay a lot more, be punished with latencies, or leave the cloud.”
GitLab.com: How We Knew It Was Time to Leave the Cloud
HN discussion (with Cloud Apologists)