Are you self-hosting LLMs (AI models) on your headless servers? I’d like to hear about your hardware setup. What server do you have your GPUs in?

When I do a hardware refresh I’d like to ensure my next server can support GPU(s?) for local LLM inferencing. I figured I could put in either a 4090 or x2 3090’s(?) maybe into an R730. But I’ve only barely started to research this. Maybe it isn’t practical.

I don’t know much other hardware lineups besides the Dell R7xx lineup.

I host oobagooba on an R710 as a model server API, and host sillytavern and stable diffusion which use oobagooba as clients. I use an R710 using a CPU, so as you can imagine inferencing is so slow it’s basically unusable. But I wired it up as a proof of concept.

I’m curious what other people who self-host LLMs do. I’m aware of remote options like Mancer or Runpod. I’d like the option for purely local inferencing.

Thanks all

  • PDXSonic@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    https://www.ebay.com/itm/364128788438?mkcid=16&mkevt=1&mkrid=711-127632-2357-0&ssspo=XkOKzd0RR_6&sssrc=4429486&ssuid=9jfKf00cSoK&var=&widget_ver=artemis&media=COPY

    At least according to this fairly detailed eBay listing you might be limited in what GPUs you can run in an R730. It states a 300w max per card and double width, which would eliminate both the power and physical requirements of the 3090/4090. You could run say some Tesla P40s but they would be a bit slower.

    Another option would to be just buy a rack mount 4U case and say a X99 motherboard (same era CPU as the R730) which would give you a bit more flexibility in running 3090/4090 cards so long as you had a 1200w or so PSU.

    • literal_garbage_man@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Yeah running a 4U case and assembling it with “plain desktop” hardware but rack mounted and headless is definitely an option too. I might be asking too much of server hardware to take R730s (or any racked datacenter hardware) and fit them to a role they weren’t designed for. These are good thoughts and useful links, thank you.

  • tigress667@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    One challenge with the 4090 specifically is I don’t believe there are any dual-slot variants out there, even my 4080 is advertised as a triple-slot card (and actually takes four because Zotac did something really, really annoying with the fan mounting)…you could liquid-cool and swap the brackets, but then you have the unenviable task of mounting sufficient radiators and support equipment (pump, res, etc) into a rackmount server. That assumes you’re looking at something 2-3U, since you mentioned an R730; if you’re willing to do a whitebox 4U build it’s a lot more doable.

    Of course if money is no object, ditch plans for the GeForce cards and get the sort of hardware that’s made to live in 2U/3U boxes, i.e. current-gen Tesla (or Quadro, if you want display outputs for whatever reason). If money is an object, get last-gen Teslas. Tossed an old Tesla P100 (Pascal/10-series) into my Proxmox server to replace a 2060S with half the VRAM, for LLMs I didn’t really notice an obvious performance decrease (i.e. still inferences faster than I can read), and in a rack server you won’t even have to mess with custom shrouds for cooling, since the fans in the server are going to provide more than enough directed airflow.