ZFS has served me well at home and at work for the past ~20 years, but I’m starting to hit scaling limits with just single nodes. 2024 will likely see my office deploy our first 4-5 node Ceph Cluster and I’d like to prepare for that day in my homelab. To that end, I’ve assembled a hardware template for a 4-node system below and would welcome any thoughts or recommendations on design (or anything else).

Ideally, if we come up with a useful cost-effective hardware template, others in the community will be able to use it to build out their own clusters.

Priorities

  1. Minimize upfront hardware costs
  2. Minimize recurring operational costs (i.e., power consumption)
  3. Performance (my only use case: streaming 100-200GB files to 1 client at 80-120 MB/sec. Largely WORM workload. Goal is to saturate a single 1GBe link.)

Current Hardware

  • Data HDDs: 24x 8TB SATA, Seagate BarraCuda (ST8000DM008)
  • Data HDDs: 24x 12TB SATA, Western Digital Red Pro NAS (WD121KFBX)
  • Data HDDs: 24x 16TB SATA, Western Digital Gold Enterprise (WD161KRYZ)
  • Data HDDs: 24x 20TB SATA, Seagate Exos X20 (ST20000NM007D)
  • Chassis: 4x 24-bay 4U Hotswap NAS Case w/ 6x 6Gbps SF-8087 backplanes, Innovision (S46524)
  • Motherboard: 4x Supermicro X11SSL-F
  • CPU: 4x Intel Xeon E3-1230 v6 (4c/8t) @ 3.50GHz
  • RAM: 4x 64GiB (4x16 kit) Supermicro DDR4 2400 VLP ECC UDIMM Memory RAM
  • HBA: 4x LSI 9201-16i 6Gbps 16-lane + 4x AOC-USAS2-L8i 6Gbps 8-Lane
  • OS SSDs: 8x 250GB SATA, Samsung 870 EVO (MZ-77E250B) (2x mirror per chassis)

Current Plan

Make 4x OSD nodes using the above HDDs, chassis, motherboards, CPUs, RAM, HBAs, and SSDs. Distribute the HDDs such that 6x of each drive goes into each 24-bay chassis like so: 6x 8TB, 6x 12TB, 6x 16TB, and 6x 20TB. This would ensure that each node is equally sized.

Buy 4x 10GbE PCIe cards, 1 for each node + switch + cables.

I have not yet spec’d out the Monitor and Manager Nodes and was considering running them on the same hardware as the OSD nodes. Thoughts on this are welcome.

Questions

  1. Is the above hardware capable of fully saturating a 1GbE link to 1 client? Use case: streaming a single 100-200GB file to a speciality piece of lab equipment at a rock-steady rate above 80MB/sec, without any buffering. This client has some truly awful firmware and can crash if its buffers run empty, so I’m trying to design for a constant 80+ MB/sec. Read behavior is largely linear. At most 10-15 jumps throughout the file, for each file, as 10-20GB sections are streamed in sequentially.

  2. Are these CPUs (4-cores/8-threads at 3.5 GHz) enough to handle 24 OSD daemons per chassis?

  3. Is 64GiB per chassis enough for 24 OSD deamons per chassis?

  4. I’d like this design to be able to withstand the failure of any 12 drives anywhere in the cluster. It’s not clear to me how I’d specify that from a CRUSH failure domain perspective. Guidance here welcome.

  • ZombieLinux@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    This isn’t a bad plan. 64g of ram might be on the low side depending on what else is running.

    You’ll 100% saturate that 1g link. Might even saturate a 10g link. I’d recommend going with 25 or 40g links for your ceph cluster. Some mellanox switches can be found relatively cheap.

    Also, it should be an odd number to avoid split brain.

    • zacharyfreeman70@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Regarding 64GiB RAM being too low, I suspect you might be right. Perhaps going up to 96GiB or 128GiB would be better? The total raw storage in the cluster will be 1,344TiB, but with Erasure Coding/redundancy, usable storage should fall just below the petabyte level. I’m keen to minimize buying any new hardware but if 64GiB per node just isn’t feasible, then I’m fine with spending to get to the bare minimum goal of saturating that 1 GbE link.

      Speaking of which, regarding saturating the 1GbE link, that’s good to know. So long as I can do that, that’s all that really matters performance-wise.

      Regarding “it should be an odd number to avoid split brain”, does the “it” in your statement mean the number of OSD nodes, the number of Manager nodes or the number of Monitor nodes?