How do you find the bottleneck of a network?

wop · edit-2 2 years ago

How do you find the bottleneck of a network?

Kazaii@sh.itjust.works · 2 years ago

Pretty good suggestions here. Can’t remember the last time I saw such quality replies on r/networking .

Avian_Carrier · edit-2 2 years ago

deleted by creator

taladar@sh.itjust.works · 2 years ago

Are you sure that the download speed is 10Mbit/s and not 10Mbyte/s which would be close to saturating the 100Mbit/s link and would explain the other symptoms you are seeing?

wop · 2 years ago

Valid question. We’ve checked it multiple times, on the client and via monitoring that it is 10 Mbits. Thank you.

taladar@sh.itjust.works · 2 years ago

Have you checked for resent packets or connection resets or similar things that might use up more bandwidth than the successfully received packets? I would probably use Wireguard for that.

wop · 2 years ago

Not yet. Wouldn’t expect it tbh, but you’ll never know. How would you utilize Wirehuard for it? I’d like to hear more about it.

taladar@sh.itjust.works · 2 years ago

Oops, I meant Wireshark of course. Basically capture the packets and then check for any with errors.

wop · 2 years ago

Gotcha! - I thought Wireguard might has some logging features that could provide some insights. Thank you.

InEnduringGrowStrong@sh.itjust.works@sh.itjust.works · 2 years ago

Care to tell me those numbers?

RTT
TCP Window Size (RWND)
TCP MSS (that or MTU, inside the tunnel)

Honestly, that sounds like TCP bandwidth-delay product.
50 ms with a 65k byte RWND is just around 10Mbps.
See for yourself with your numbers:
https://wintelguy.com/wanperf.pl

InEnduringGrowStrong@sh.itjust.works@sh.itjust.works · 2 years ago

Just saw this part:

the whole location in UK […]

Some VPN solutions downgrade the MSS of all VPNs to the lowest common denominator for things like MTU/MSS. I guess that can make sense in a full-mesh, but whatever.
Take a packet capture of another client while the problem one connects, you’ll likely see something.
Decrypted traffic is usually easier to analyze.

Ohhh and you say that’s when they connect through SSH? Check that he’s not tcp forwarding all traffic through his SSH connection somehow.

wop · 2 years ago

Getting a pcap of another client could bring some insight, yeah.

SSH is used for the data transfer. Without knowing it at this moment, I’d assume scp or rsync. You mean whether all their internet traffic is routed through the active SSH session?

InEnduringGrowStrong@sh.itjust.works@sh.itjust.works · 2 years ago

I mean that in an SSH connection you can configure it to bind local/remote ports of local/remote IPs.
The user might have unknowingly or maliciously configured their stuff to either:

forward all their traffic through the ssh session, adding more bandwidth than you’re expecting
remote port forward something important that’s somehow used by all your users to his machine. This is a bit unlikely, but then your symptoms are a bit weird.

Unlikely, because they couldn’t bind a port that is already in use on the server. Still, that could technically happen if there’s a misconfigured load balancer, maybe from an old config that was never removed, that has that server as a member and just declares it down/up when that user starts listening on that port.

That last one is far-fetched.
I’d start with cpu/mem, mtu/mss, etc.

I tend to have a bit of a bias towards absolutely far-fetched things because I’m basically the last line of support where I work. This means ~~all~~ most of the “normal” problems get filtered out before they get to me, which leaves me with the stuff that’s bananas.

wop · 2 years ago

I’ll keep that in mind

wop · 2 years ago

Ping - Update 2 Your numbers are are still missing since I havent had time to look into the pcaps yet. I hope I can get it done by the end of the week, but we are a little bit wiser.

wop · edit-2 2 years ago

I haven’t had the chance to get a pcap yet. As soon as I get my fingers on the test clients, I’ll check them and additionally do testing with TCP and UDP transfers. I’ll let you know.

Just to clarify: this would be the limit for a single TCP connection and yes, could be the limit for this one download. This would not explain, why the rest of the location is affected if theoretically 90% of the bandwidth is still available, no? - Please correct me if I am wrong here.

InEnduringGrowStrong@sh.itjust.works@sh.itjust.works · 2 years ago

This would not explain, why the rest of the location is affected

Yea my bad, I missed the “whole location” bit on first read. This would be the limit for that TCP session.
Still, I’d compare MTU, MSS, RTT packet loss, RWIND, etc. everything that is a component of the actual bandwidth. Whatever happens, some of these things change when he connects.
I’ve had VPN solutions that downgrade the MTU for everyone when someone with a shit MTU connects to it.

Another thing that came to mind, since you were talking about an SSH connection, is if that user is somehow routing trafic through a tcp forward inside his tunnel when connecting. Stupid test maybe, but I’d compare the before/after TTL of packets in a flow that is known to be affected as well as a traceroute (assuming the client can even run one, because every business seems to like breaking icmp)

wop · 2 years ago

Will compare it as soon as I get my hands on the machine.

And yeah, we do tend to block ICMP over here too.

phase_change@sh.itjust.works · 2 years ago

If the bandwidth numbers you’ve described are accurate, I’d start looking at CPU and RAM usage on the network device. The Fortigates are going to be doing extra work to handle the VPN. I wouldn’t expect an IPSEC VPN on a Fortigate to top out at 10mbps, but if it’s doing a lot of other work, it’s possible. ACL’s on the Cisco devices? You run the potential of CPU/RAM exhaustion on those. Hopefully, you have remote monitoring on all network devices and you can just look at the history when these transfers are happening.

If nothing obvious there, then I’d try packet captures when this is happening, perhaps to start on the system doing the ssh and on one or two others experiencing issues. What are you seeing? Evidence of dropped packets? High latency? If dropped packets, start capturing the same traffic on the network devices it’s flowing through.

wop · 2 years ago

Good points! I’ll get access to test clients on both sites to do some testing. If I get the problem reproduced, I can take some packet captures without getting the client involved while monitoring the hardware. I’ll be smarter after that session.

Thank you!

wop · 2 years ago

Added the Update 2. Still some things to do, but we know a little bit more now. Feedback and questions are still welcome.

phase_change@sh.itjust.works · 2 years ago

Nice job. Packet loss will definitely cause these issues. Now, you just need to find the source of the packet loss.

In your situation, I’d first try to figure out if it is ISP/Internet before looking inside either network. I wouldn’t expect it to be internal at these speeds. Though, did you get CPU/RAM readings on the network equipment during these tests? Maxing out either can result in packet loss.

I’d start with two pairs of packet captures when the issue happened: endpoint to endpoint and edge router to edge router. Figure out if the packet loss is only happening in one direction or not. That is, are all the UK packets reaching DE but not all the DE making it back? You should clearly be able to narrow into a TCP conversation with dropped packets. Dropped packets aren’t ones that a system never sent, they’re ones that a system never received. Find some of those and start figuring out where the drop happened.

wop · 2 years ago

The ISPs are slow to answer if there is no active outage. Will take some time anyway.

Packets are dropped in bot directions. I am currently looking through the pcaps and will do another stress test later - got another window. MTU/MSS is the prio today.

phase_change@sh.itjust.works · 2 years ago

Just curious if you’ve had the chance to dig into this and can report anything back?

wop · 2 years ago

Ping - Update 2

wop · 2 years ago

Not yet. Just got access to the test clients and I have planned to do a troubleshooting session tomorrow in the morning. Not a big fan of stress testing the network on a working day haha

wop · 2 years ago

Update 3 - Ping