#210417 - 2015-07-27 02:01 PM
Advice on troubleshooting network issues
|
Robdutoit
Hey THIS is FUN
 
Registered: 2012-03-27
Posts: 347
Loc: London, England
|
Hi guys,
Sorry for a long post, but I could do with some advice. I have spent several days in the last month troubleshooting a wierd intermittent Internet issue. I think that I have rsolved it, but would like suggestions on what further testing etc I can do.
Breakdown of Network
To give you a breakdown. Its a small primary school. 20 computers, 45 laptops and 45 Ipads more or less. Two servers for database storage, one proxy/caching/filtering server for internet access for the school.
The problem
Staff reported that the Internet was periodically slow and timing out. Not all the time though. Also sometimes the sending of email failed with the message that DNS server could contact contact servername blah blah. We have an internal mail server system.
Resolution
First thing that I did is that I connected the BT Router directly to the fibre optic modem and connected to the Internet through a direct link. Internet all working. Great, so the problem must be something on the Lan.
So after much investigation and getting nowhere, I disconnected the entire network and connected up one switch, the office computers and the teachers computers in the school. Everything else - wireless, ipads, kids computers and laptops were disconnected.
Internet and email seemed to work - so connected up the kids computers and a second switch, but left the wireless off. Internet continued to work (in the sense that nobody reported anything, but in hindsight maybe the Internet was slow), and the email worked most days, but we did have problems on two instances on different days sending emails.
So I suggested that we move forward the upgrade of the existings switches as only one switch was GB. Put in a brand new 48 port GB switch and connected kids computers and staff computers again to the new switch, but again no wireless and no ipads. Internet continued to work, but still having the odd issue with sending emails
I decided to add the wireless at that point as the staff had been without wireless for a couple of weeks by then. Immediately on connecting the wireless, the Internet went down or was incredibly slow! So I thought aha - wireless was the issue. The wireless crashed the Internet for everyone - computers wired in (no wireless) as well as anyone on the actual wireless.
However
I tested each wireless access point and disconnected the two that seemed to cause the Internet to crash completely. That left three access points. However the next morning, they still complained of sent emails bouncing back with DNS server unable to locate server blah blah and the Internet crashed again around 11am and nothing online worked.
Puzzling Factors
1. When staff and kids went home - Internet worked beautifully even with every single wireless access point on. Why would this happen? 2. The BT Router did seem more sluggish with everything connected - the less computers on, the less wap's on, the faster loading the webadmin pages on the router was. Why would this happen? 3. Even with the wireless off, while the Inernet seemed to work, we still had issues with sending the odd email - so either we had two issues - the wireless crashing the Internet and something else or we had one issue that was causing the wireless to crash the Internet and also causing problems sending emails. Did we have two separate issues that occured at the same time or did we have one issue that was causing both problems?
Final Resolution
I realised that I was too close to the problem and was chasing every red herring. So I took a step back and said - what is the problem - The Internet. So I stopped looking at the wireless and the mail server etc and started diagnostics on soley the equipment required to get the Internet to work.
Connected everything up and naturally Internet went down, so I went into diagnostic mode
I took two computers and made changes on one and compared with the other.
1. I bypassed the proxy server, local Winsdows DNS and DHCP Server - No change 2. I changed the Internet DNS server in the tcp/ip settings. It appeared that opennics DNS Server was faster and more reliable than BT's DNS Server. 3. I changed the DNS Server to the BT DNS server in tcp/ip settings. In other words, I bypassed the BT Router as the DNS Server. Changed the other computer to use Opennic. Interestingly enough the BT DNS server was faster than Opennics.
So it seemed that the actual BT Router itself was the problem as I effectively bypassed the DNS routing of the BT Router - I just used the router as a gateway.
So I swopped over with another BT Router and everything seemed to work. Pages were loading faster on the actual BT Router and the dns resolution seemed to be working as fast as opennic.
So it would appear that the BT Router had some wierd issue where if there was a lot of traffic on the Lan, this caused the BT Router to become unresponsive. Given that the BT Router had received a firmware update about a week before all the problems started and had rebooted on the very day that the problems started and given that the BT Router was having issues with dropped ports on different days - on the face of it - it would seem that all the issues were with the BT Router.
Where to now
My current issue is that by the time I had worked out that it must have been the Router all along, the kids and the staff had left - remember everything seems to work when everyone leaves the building! Although the new router does seem to be faster than the old one. It was also the last day of school and they won't be back until September! So I am unable to get the staff to check whether everything actually works until September.
I don't understand why the wireless or any Lan traffic would cause problmes for the BT Router?
I am not 100% positive that I have resolved the issue (although I now believe it is the Router), but given that the school is closed until September and the problems only occur when people are using the network - I have no idea what I can do next to determine that all is well. I intalled wireshark and could not see what to do with the program - granted I was in a rush and its not the best time to learn how to use a program that you have never used before.
Do you have any advice on whether my troubleshooting steps have correctly identified the issue give the oddness that the wireless would cause such outage, but only during the day when people were using the network. Is there anything further that I can do to investigate what is going on with this network.
I think that this is one area where my IT skills are very weak. I don't know how to monitor bandwidth traffic or identify a switch that is faulty or identify why any given Lan is slow - all these skills would have helped with troubleshooting this issue. My skills are more in Windows and Linux Servers, hardware troubleshooting of servers and computers. Obviously I can setup a network, but troubleshooting switches etc is not a skill that I have needed to use very often in the last 15 years as my clients have small networks.
I think that I need to get some network monitoring software that can alert me when there is unusual traffic or some problem.
Any Advice on network monitoring software as well as what more can I do to troubleshoot this issue. Thanks
|
Top
|
|
|
|
#210419 - 2015-07-28 01:19 PM
Re: Advice on troubleshooting network issues
[Re: Robdutoit]
|
Glenn Barnas
KiX Supporter
   
Registered: 2003-01-28
Posts: 4381
Loc: New Jersey
|
A couple of things that I always check when working on odd performance issues:
DNS - Use an internal DNS server (Required if you use any form of dynamic hostname registration!) and DO NOT use forwarders. Forwarders are often improperly used and when deployed for primary name resolution, you then limit yourself to just those servers, and any issues with those servers has a ripple effect in your network. Use the root hints on your internal server - that's what they're for.
Use a quality, managed switch - even if it's one you own and bring on-site for troubleshooting. The diagnostic information available is priceless. (I just bought a Cisco 4506 with dual power and 144 Gig-E POE ports for $300 USD when we had issues in our office. I have one at home as well and it interfaces well with OpenNMS - a free, commercial grade network monitor application.) With the data available, you can pinpoint the port(s) where errors are occurring.
Wiring - do a visual inspection and repair any termination that isn't 100%. I once visited a client who had 3 switches, 3 Internet connections (1 per switch) and 3 NICs in their server. 54 workstations and 1 server at the site. The tech told me it wasn't possible to put more than 16 hosts on a switch, hence the "3 of everything". We came in with a Cisco managed switch, moved everything over, and - sure enough - nothing worked! He also said he had one system that took 5 minutes to boot and he could not figure it out. Looking at that PC, I saw that the network cable came out of the ceiling, down the wall, and plugged right into the back of the PC's NIC. The jacket was stripped back about an inch, so no strain relief, and it used the wrong RJ45 plug. When I snipped the end off, he freaked out, saying he had just rewired the entire building. (Uh-oh!) I re-terminated the cable with the correct RJ45 plug type, plugged it in, and the system rebooted in about 40 seconds. At that point, we did a visual inspection and found that most of the cables were poorly terminated. We spent the entire day replacing the terminations with wall jacks and patch cords.
Cabling isn't magic, but it is an art, and it's easy to do things wrong. For example, the RJ45 plugs come in two types - the common (and cheap! around 5-7 cents ea) 2-point and the less common and more expensive (85 cents each) tri-point. The Tri-Point is designed for solid connectors, with two fingers on one side and one on the other side of the conductor. It will work for stranded cable as well, but that results in an expensive job. The 2-point plugs are designed for stranded conductor wire ONLY, and using them on solid conductor WILL result in a poor connection (just two tiny points of contact) that will often become high-resistance connections if your strain-relief isn't rock solid. When it isn't, a tug on the cable can cause the wire to move, the insulation slides under one or both points, and that connection becomes poor or dead entirely. What had happened at this client was that enough bad connections was generating so much noise and retry traffic that the switch was unable to cope. After replacing the cable ends with proper terminations, we moved onto a single subnet using the new 48 and an existing 24 port switch, and eliminated 2 NICs and 2 Internet service connections.
Ideally, you use solid wire for a run from a patch panel to a wall jack, then use patch cords from the switch to patch panel and jack to computer. Those are the cables most likely to be damaged through use and thus are easy to examine and inexpensive to replace.
You seem to be on the right track for diagnosing the issue - use a single laptop (for consistent readings) and work from the Edge Router, Firewall, Switch, endpoint jack, first with no devices on the switch and then add them back. You'll need an assistant for this, but it goes quick and will help identify a specific cable or workstation that's causing the issue.
Glenn
_________________________
Actually I am a Rocket Scientist!
|
Top
|
|
|
|
#210421 - 2015-07-28 06:19 PM
Re: Advice on troubleshooting network issues
[Re: Glenn Barnas]
|
Robdutoit
Hey THIS is FUN
 
Registered: 2012-03-27
Posts: 347
Loc: London, England
|
Hi Glenn,
Thanks for your response. Always helpful.
We have a brand new switch which I have used in several other schools so I know its not the switch.
I will query regarding the cabling as the school are having building works going on - so you may have something there. However, hopefully the BT router was the problem. The new Router does seem to be faster.
I was actually going to buy Cables to Go Patch leads for the cabinet and for the computers to replace all the existing cabling - Cables 2 Go What do you think of these cables. I had a look at the specs and they seem to higher than the average standard - ie 350Mhz, 24AWG, Gold Plating - 50µm
I will ask my cabling guy if he has anything that can test for noise on the network as a result of poor terminations. That school does have cabling that was put in some years ago that was done by a professional company - sort of a handyman's job. Not that the issue has anything to do with this I think because the only thing that was not working was the Internet. Access to the servers worked, printers worked etc and the cabling was installed years ago. It is interesting to know that poor cabling can cause that kind of problem.
With regards to the DNS - I am not sure what you are recommending here. When I started my broadband service a couple of years ago, I set it up to use the Windows DNS Servers Root Hints Servers for Internet Resolution. This worked fine for a couple of months when all of a sudden the Internet stopped working at several clients - on the same day. I changed it so that the windows server forwards to the BT router which then uses the BT Internet Routers and I have never had a problem since. So I am reluctant to change it back to using the root hints given that the Internet stopped working at several clients on the same day!
At the moment the setup that I have is the clients use the windows server as the default server. Anything that the windows Dns server cannot resolve (ie non domain computers) gets forwarded to the BT router which then connects to the ISP Dns Servers. Are you saying that you would use the internal DNS server for both internal Domain Name resolution and for Internet resolution?
I will have a look at this OpenNMS that you refer to. Maybe I can get it to work with the new switch we bought - a Cisco SLM2048PT-UK
I think my first port of call would be to install some network monitoring software to see if there is any traffic that is unusual. I am busy researching that now.
Thanks for the input. Helpful to know that I have not missed too much.
|
Top
|
|
|
|
#210430 - 2015-07-29 12:57 PM
Re: Advice on troubleshooting network issues
[Re: Robdutoit]
|
Glenn Barnas
KiX Supporter
   
Registered: 2003-01-28
Posts: 4381
Loc: New Jersey
|
The DNS bug was the result of a bad hotfix, quite a long time ago. A fix was released within 24 hours, but the reputational damage was done, and the "fix" that most people implemented was to use their ISP's DNS servers in their forwarders list.
Even worse, I've seen IT staff deploy public DNS servers through their DHCP. Sometimes they include this in addition to their internal servers and sometimes instead. I know one IT guy that uses his old ISP's DNS servers at half a dozen or so remote sites of a company he supports and wondered why A) he could not join workstations to AD, and B) why he had to point to their RDS servers by IP. Just last month, a developer suggested that Google's DNS (8.8.8.8) be added to the internal network. They get DHCP from their firewall, and made a request to the ISP to add this to the DNS server list. Anyone want to guess how long it took to lose connection with AD? Or who they called, hopping mad, when "AD Infrastructure Failed"?
If I recall, this bad patch was shortly after Server 2003 was released, so about 11 years ago, and affected NT, Win2K, and Win2K3 platforms. This is a long time to hold a grudge. I've successfully used Windows DNS without forwarders since first publishing an article on NT Network Basics back in 1997-98.
Glenn
_________________________
Actually I am a Rocket Scientist!
|
Top
|
|
|
|
#210438 - 2015-07-30 10:51 AM
Re: Advice on troubleshooting network issues
[Re: Glenn Barnas]
|
Robdutoit
Hey THIS is FUN
 
Registered: 2012-03-27
Posts: 347
Loc: London, England
|
I know where to find you if the root hints fail. It would be interesting to see if it does it again.
|
Top
|
|
|
|
Moderator: Arend_, Allen, Jochen, Radimus, Glenn Barnas, ShaneEP, Ruud van Velsen, Mart
|
0 registered
and 284 anonymous users online.
|
|
|