Sorry for a long post, but I could do with some advice. I have spent several days in the last month troubleshooting a wierd intermittent Internet issue. I think that I have rsolved it, but would like suggestions on what further testing etc I can do.
Breakdown of Network
To give you a breakdown. Its a small primary school. 20 computers, 45 laptops and 45 Ipads more or less. Two servers for database storage, one proxy/caching/filtering server for internet access for the school.
Staff reported that the Internet was periodically slow and timing out. Not all the time though. Also sometimes the sending of email failed with the message that DNS server could contact contact servername blah blah. We have an internal mail server system.
First thing that I did is that I connected the BT Router directly to the fibre optic modem and connected to the Internet through a direct link. Internet all working. Great, so the problem must be something on the Lan.
So after much investigation and getting nowhere, I disconnected the entire network and connected up one switch, the office computers and the teachers computers in the school. Everything else - wireless, ipads, kids computers and laptops were disconnected.
Internet and email seemed to work - so connected up the kids computers and a second switch, but left the wireless off. Internet continued to work (in the sense that nobody reported anything, but in hindsight maybe the Internet was slow), and the email worked most days, but we did have problems on two instances on different days sending emails.
So I suggested that we move forward the upgrade of the existings switches as only one switch was GB. Put in a brand new 48 port GB switch and connected kids computers and staff computers again to the new switch, but again no wireless and no ipads. Internet continued to work, but still having the odd issue with sending emails
I decided to add the wireless at that point as the staff had been without wireless for a couple of weeks by then. Immediately on connecting the wireless, the Internet went down or was incredibly slow! So I thought aha - wireless was the issue. The wireless crashed the Internet for everyone - computers wired in (no wireless) as well as anyone on the actual wireless.
I tested each wireless access point and disconnected the two that seemed to cause the Internet to crash completely. That left three access points. However the next morning, they still complained of sent emails bouncing back with DNS server unable to locate server blah blah and the Internet crashed again around 11am and nothing online worked.
1. When staff and kids went home - Internet worked beautifully even with every single wireless access point on. Why would this happen?
2. The BT Router did seem more sluggish with everything connected - the less computers on, the less wap's on, the faster loading the webadmin pages on the router was. Why would this happen?
3. Even with the wireless off, while the Inernet seemed to work, we still had issues with sending the odd email - so either we had two issues - the wireless crashing the Internet and something else or we had one issue that was causing the wireless to crash the Internet and also causing problems sending emails. Did we have two separate issues that occured at the same time or did we have one issue that was causing both problems?
I realised that I was too close to the problem and was chasing every red herring. So I took a step back and said - what is the problem - The Internet. So I stopped looking at the wireless and the mail server etc and started diagnostics on soley the equipment required to get the Internet to work.
Connected everything up and naturally Internet went down, so I went into diagnostic mode
I took two computers and made changes on one and compared with the other.
1. I bypassed the proxy server, local Winsdows DNS and DHCP Server - No change
2. I changed the Internet DNS server in the tcp/ip settings. It appeared that opennics DNS Server was faster and more reliable than BT's DNS Server.
3. I changed the DNS Server to the BT DNS server in tcp/ip settings. In other words, I bypassed the BT Router as the DNS Server. Changed the other computer to use Opennic. Interestingly enough the BT DNS server was faster than Opennics.
So it seemed that the actual BT Router itself was the problem as I effectively bypassed the DNS routing of the BT Router - I just used the router as a gateway.
So I swopped over with another BT Router and everything seemed to work. Pages were loading faster on the actual BT Router and the dns resolution seemed to be working as fast as opennic.
So it would appear that the BT Router had some wierd issue where if there was a lot of traffic on the Lan, this caused the BT Router to become unresponsive. Given that the BT Router had received a firmware update about a week before all the problems started and had rebooted on the very day that the problems started and given that the BT Router was having issues with dropped ports on different days - on the face of it - it would seem that all the issues were with the BT Router.
Where to now
My current issue is that by the time I had worked out that it must have been the Router all along, the kids and the staff had left - remember everything seems to work when everyone leaves the building! Although the new router does seem to be faster than the old one. It was also the last day of school and they won't be back until September! So I am unable to get the staff to check whether everything actually works until September.
I don't understand why the wireless or any Lan traffic would cause problmes for the BT Router?
I am not 100% positive that I have resolved the issue (although I now believe it is the Router), but given that the school is closed until September and the problems only occur when people are using the network - I have no idea what I can do next to determine that all is well. I intalled wireshark and could not see what to do with the program - granted I was in a rush and its not the best time to learn how to use a program that you have never used before.
Do you have any advice on whether my troubleshooting steps have correctly identified the issue give the oddness that the wireless would cause such outage, but only during the day when people were using the network. Is there anything further that I can do to investigate what is going on with this network.
I think that this is one area where my IT skills are very weak. I don't know how to monitor bandwidth traffic or identify a switch that is faulty or identify why any given Lan is slow - all these skills would have helped with troubleshooting this issue. My skills are more in Windows and Linux Servers, hardware troubleshooting of servers and computers. Obviously I can setup a network, but troubleshooting switches etc is not a skill that I have needed to use very often in the last 15 years as my clients have small networks.
I think that I need to get some network monitoring software that can alert me when there is unusual traffic or some problem.
Any Advice on network monitoring software as well as what more can I do to troubleshoot this issue. Thanks