ZFS has served me well at home and at work for the past ~20 years, but I’m starting to hit scaling limits with just single nodes. 2024 will likely see my office deploy our first 4-5 node Ceph Cluster and I’d like to prepare for that day in my homelab. To that end, I’ve assembled a hardware template for a 4-node system below and would welcome any thoughts or recommendations on design (or anything else).

Ideally, if we come up with a useful cost-effective hardware template, others in the community will be able to use it to build out their own clusters.

Priorities

  1. Minimize upfront hardware costs
  2. Minimize recurring operational costs (i.e., power consumption)
  3. Performance (my only use case: streaming 100-200GB files to 1 client at 80-120 MB/sec. Largely WORM workload. Goal is to saturate a single 1GBe link.)

Current Hardware

  • Data HDDs: 24x 8TB SATA, Seagate BarraCuda (ST8000DM008)
  • Data HDDs: 24x 12TB SATA, Western Digital Red Pro NAS (WD121KFBX)
  • Data HDDs: 24x 16TB SATA, Western Digital Gold Enterprise (WD161KRYZ)
  • Data HDDs: 24x 20TB SATA, Seagate Exos X20 (ST20000NM007D)
  • Chassis: 4x 24-bay 4U Hotswap NAS Case w/ 6x 6Gbps SF-8087 backplanes, Innovision (S46524)
  • Motherboard: 4x Supermicro X11SSL-F
  • CPU: 4x Intel Xeon E3-1230 v6 (4c/8t) @ 3.50GHz
  • RAM: 4x 64GiB (4x16 kit) Supermicro DDR4 2400 VLP ECC UDIMM Memory RAM
  • HBA: 4x LSI 9201-16i 6Gbps 16-lane + 4x AOC-USAS2-L8i 6Gbps 8-Lane
  • OS SSDs: 8x 250GB SATA, Samsung 870 EVO (MZ-77E250B) (2x mirror per chassis)

Current Plan

Make 4x OSD nodes using the above HDDs, chassis, motherboards, CPUs, RAM, HBAs, and SSDs. Distribute the HDDs such that 6x of each drive goes into each 24-bay chassis like so: 6x 8TB, 6x 12TB, 6x 16TB, and 6x 20TB. This would ensure that each node is equally sized.

Buy 4x 10GbE PCIe cards, 1 for each node + switch + cables.

I have not yet spec’d out the Monitor and Manager Nodes and was considering running them on the same hardware as the OSD nodes. Thoughts on this are welcome.

Questions

  1. Is the above hardware capable of fully saturating a 1GbE link to 1 client? Use case: streaming a single 100-200GB file to a speciality piece of lab equipment at a rock-steady rate above 80MB/sec, without any buffering. This client has some truly awful firmware and can crash if its buffers run empty, so I’m trying to design for a constant 80+ MB/sec. Read behavior is largely linear. At most 10-15 jumps throughout the file, for each file, as 10-20GB sections are streamed in sequentially.

  2. Are these CPUs (4-cores/8-threads at 3.5 GHz) enough to handle 24 OSD daemons per chassis?

  3. Is 64GiB per chassis enough for 24 OSD deamons per chassis?

  4. I’d like this design to be able to withstand the failure of any 12 drives anywhere in the cluster. It’s not clear to me how I’d specify that from a CRUSH failure domain perspective. Guidance here welcome.

  • ZombieLinux@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    This isn’t a bad plan. 64g of ram might be on the low side depending on what else is running.

    You’ll 100% saturate that 1g link. Might even saturate a 10g link. I’d recommend going with 25 or 40g links for your ceph cluster. Some mellanox switches can be found relatively cheap.

    Also, it should be an odd number to avoid split brain.

    • zacharyfreeman70@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Regarding 64GiB RAM being too low, I suspect you might be right. Perhaps going up to 96GiB or 128GiB would be better? The total raw storage in the cluster will be 1,344TiB, but with Erasure Coding/redundancy, usable storage should fall just below the petabyte level. I’m keen to minimize buying any new hardware but if 64GiB per node just isn’t feasible, then I’m fine with spending to get to the bare minimum goal of saturating that 1 GbE link.

      Speaking of which, regarding saturating the 1GbE link, that’s good to know. So long as I can do that, that’s all that really matters performance-wise.

      Regarding “it should be an odd number to avoid split brain”, does the “it” in your statement mean the number of OSD nodes, the number of Manager nodes or the number of Monitor nodes?

  • Sporkers@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    CPU seems low from what I have read other places and if writes matter a few used Enterprise SSDs per chassis as DB/Wall for the HDDs would be nice.

    • zacharyfreeman70@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      In your experience, what have you found to be the bare minimum? a 4c/8t CPU at 3.5Ghz does indeed sound a bit undersized for 24 HDD-based OSDs, so I’d be curious to read what others are running.

      • Sporkers@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        10 months ago

        I don’t really have a lot of experience with this, I just read a ton and built a modest 5 node homelab cluster and 5 nodes seemed to be the the minimum count you want to be at. The recommendation for Ceph now are so vague, all the documentation has changed in recent years to talk about IOPS/core but is very vague about it. So it depends on how much performance you really expect out of it, higher expectation give it more cores. NVME devices for sure scale with more cores from 2 to 4 cores shows 100% iops scaling in Ceph Docs and keep scaling decently past there if isolating a single OSD for performance testing with enterprise NVME drives.

        But you are using HDDs, in a homelab and on a budget. I think your 4 cores would be the extreme low budget, the not expecting performance option for that many OSDs, and 8 cores would be the more regular budget option minimum and do 12-16 if I had heavier use/performance goals and then have more than 64GB RAM per node especially if the monitors are co-located. And next level would be to add maybe 4-8 used enterprise class NVME drives per node and spread the DB/Wall for the OSDs across those NVME drives and more cores to handle them.