Sunday, February 22, 2009
After a quick look at my Exchange server, I found the email was stuck in the outbound queue; even odder was there were no delivery attempts in the SMTP log. At this point, you start walking through how mail is sent. I started NSLOOKUP and found that I could not resolve the MX record to Doug’s email domain. For those search engines out there, Doug’s email is hosted with Google Mail, GMAIL, or googlemail. I thought, “That’s odd,” so out came the packet sniffer on my internal DNS server.
My internal DNS Server was making the requests, but was never getting a reply back. I used NSLOOKUP to make the same request to a DNS server I had at a collocation and the MX record came through fine. Hmmm. OK so next was to attempt the NSLOOKUP outside my PIX firewall. Yep it worked, so something about my firewall was blocking the DNS reply and oddly, only certain DNS replies.
I checked the Cisco PIX 501 configuration and everything was fine. DNS Fixup was enabled and the allowed DNS packet size was 2000. I had already learned the hard way that Internet DNS server were no longer limiting themselves to 512 byte requests and had increased the size.
I found some old hubs and setup packet sniffers on the inside and outside of the firewall, cleared the DNS cache, and attempted the MX lookup again. DNS Request went out, the reply came back, and the PIX was dumping the reply, never allowing the reply to make it to the inside network.
Turns out the outside sniffer saw the problem. The reply was malformed. The packet was being truncated and the last part of the DNS answer was being removed. To put it simply, the header of the DNS reply would say that 5 answers were coming, but only 4 would be listed in the packet. Since DNS Fixup was enabled on the PIX, the packet was dropped.
Part of the reason for the truncation was Google Mail is return 11 mail hosts and the DNS was hosted on 5 DNS servers. This combination was big enough to overflow a 512 byte DNS reply. Sometimes too much redundancy can be a bad thing.
What also made this interesting was how the DNS Clients were completing the DNS Request when the first reply was returned malformed. Both the Windows DNS server and Windows XP NSLOOKUP clients would open a TCP connection to the DNS after the malformed UDP packet was received. Since the PIX blocked the malformed UDP reply, the server would never try TCP.
One workaround was to disable DNS Fixup on the PIX allowing the malformed packet to traverse the PIX. After the malformed DNS reply traversed the PIX, the Windows 2003 DNS server would then open a TCP connection and retry the request and would receive a valid DNS reply. Mailed flowed again and there was some rejoicing.
Looks like it may be coming to the time where I have to upgrade my old PIX 501 running 6.3 code with something a little newer. Also if the kind people at DynDNS.com or Enom would locate the problem with their DNS servers (that’s where Doug’s DNS was being hosted), it would be appreciated by all of us on the Internet. Again for those search engines out there, the DNS servers were:
dns1.name-services.com internet address = 184.108.40.206
dns2.name-services.com internet address = 220.127.116.11
dns3.name-services.com internet address = 18.104.22.168
dns4.name-services.com internet address = 22.214.171.124
dns5.name-services.com internet address = 126.96.36.199
Hope this helped
Sunday, May 27, 2007
So today’s problem dealt with Windows Media Services. I have a friend that was looking at hosting and broadcasting video streams over the Internet and silly me said, “Hey I’ve done that and I even have a server at a collocation.”
Over the years, one of the Microsoft products I really enjoyed playing with as a hobby was Windows Media Services. It all started with a friend living oversees who wanted to watch “American” TV shows. So to help him out I started learning how to capture a TV show, encode it using Windows Media Encoder, and post it to a Windows Media Server.
A fascinating part was over time learning how to tweak the encoder properties to improve the quality. One of the best features is encoding multiple rate streams in the same file and allowing the server and player to negotiate the best stream. Towards the end, I was encoding content at multiple rates at resolutions from 1Mb/s down to 33.6Kb/s in 100Kb/s increments. This allowed devices as fast as broadband computers to cellular PDAs to automagically watch the streams that matched the network’s and device’s capability.
So here I have my old streams, a server, and I need to get a demo up and running to show what can “easily” be done. Stupid me for using the word “easily”. With my server at a co location site behind a SonicWall Firewall and a Windows Media Player 11 on Vista (hey, I like bleeding edge) behind a Cisco 501 PIX firewall at my house, the first test went great. The player opened the stream and everything appeared to be fine. The next test was to force the player to request a slower rate stream.
From within the options in Windows Media Player you can select different maximum bandwidths, forcing a server and client to negotiate a slower stream. Restarting the video with the new options, did not go so great with a rather bland error of “Windows Media Player encountered a problem while playing the file.” If I selected Web Help on the error message, slightly more information was display and root error was listed as “80070057, one or more arguments are invalid”. So the question now was, “Why do the slower streams not work?”
Step one, simplify the problem by removing devices that could be the problem. First went the PIX at my house. The PIX was performing Network Address Translation of the IP addresses entering my house network and since NAT can give audio and video streaming protocol fits, it seemed logical to try the client without the PIX. No change, error message still being displayed, so probably not the PIX or client side NAT.
The next test was to capture a packet trace when the Media Player worked and compare it to a packet trace when the media player did not work. This helped a little. In the trace that worked, the stream was being negotiated and transmitted via TCP in the trace that did not work, the stream was being negotiated via UDP and a few UDP packets were showing up in the trace. The odd part was during the negotiation, the client was replying back with a FIN flag on a TCP conversation, which kills the conversation.
Hmm, so the client is negotiating with the server and suddenly the client ends everything. I started wondering if the Windows Firewall was blocking the UDP. Disabling Windows Firewall proved that theory wrong when the problem persisted.
Moving foward, I wanted to test if it was UDP related, so changed the bandwidth setting in Windows Media Option to auto detect, and unchecked or disabled TCP. Quick try and the same error code, so UDP seemed to be the problem. Conversely, I lowered the maximum bandwidth, disabled UDP and enabled TCP, and the stream came through fine. So we are definitely dealing with a UDP problem.
Next check, does it work on the same network as the Windows Media Server. I remote consoled into the media server at the colocation site, opened Windows Media Player, changed the options to disable TCP, and played the stream. No error message! The slightly bad news was this introduced a new variable into the mix a bit. At home I was using Windows Media Player 11, but at the remote site I had version 10.
So now I am wondering if this is a bug in Windows Media Player 10 (running on Windows 2003) vs Windows Media Player 11 (running on Vista). Ran some more test from the house and it didn’t matter if I used version 11 or version 10, same error code. So this ruled out it was related to the version of Windows Media Player.
Since local worked and remote did not, I was left with the network path as the potential cause. At the collocation site, the server was firewalled from the Internet with a SonicWall firewall which was doing one to one NAT. I started looking through the firewall rule set to see if I had missed some UDP port, but everything here checked out.
So now I am left with just checking everything on the SonicWall Firewall. While going through the setup screens on the SonicWall, I found parameters for how Voice over IP was handled. Specifically there was a setting which enabled H.323 Transformations. H.323 is a standard for handling video conferencing , but both H.323 and RTSP (Real Time Streaming Protocol) use RTP (Realtime Transport Protocol) to deliver the content. Since Windows Media uses RTSP, it made we wonder if the SonicWall was mistaking the underlying RTP as H.323 and manipulating the packets.
A quick uncheck, apply, and another test was made. Holy Crap, it worked! Re-enabling the H.323 transformation on the SonicWall and the UDP streams to the media player client were broken again, confirming this was the magic parameter that needed to be changed. Since I did not have any video conferencing needs, I disabled the H.323 transformations on the SonicWall, saved the configuration, and called it quits on this problem. Beer me!
Tuesday, April 10, 2007
I have to say that was a new one. So I started by finding HHCTRL.OCX and see if the vendor was listed in the controls properties and noticed it was signed and published by Microsoft. Looking at the title bar of the error message, the application RTKHDCTL was listed as the application, and a little researched showed this was part of the Realtek Audio driver.
As a simple test, I uninstalled the Realtek audio driver using the Add/Remove Program. Reboot and the error message did not return. Because most people really do want sound on their laptop, from the Device Manager, I right clicked and reinstalled the device allowing Device Manager to install the latest driver from Windows Update. On the next reboot the error message returned. Hmm, since everything was Microsoft blessed, the next stop was http://support.microsoft.com/. A simple search there proved Melinda was not alone in her error message and it really wasn’t her fault.
It seems the problem was caused when Microsoft issues two security patches for XP, 925902 (MS07-017) and 928843 (MS07-008). The Hhctrl.ocx file that is included in security update 928843 has a conflicting base address with the User32.dll file that is included in security update 925902.
Ok, so what is a conflicting base address? Base addresses are part of the fun of writing program library and dynamically loading them into memory. When authoring a DLL (Dynamically Linked Library), you give the compiler some memory address which the compiler will make the compiled library’s base loading address. Another way of looking at this would be if the library was loaded somewhere in your computer’s memory, what would be the first memory address of the first byte of your library. All the code execution statements that need to jump to a new memory address and data being accessed are then determined relative to this base starting memory address.
During program execution, is when the DLL is actually loaded into RAM, if the memory you gave the compiler as your base address in free, the library loads into RAM and is ready to use. If the memory address range you gave during compilation is not free, then the loader must first load the library into memory, and then alter the memory addresses in the code that used an absolute address and change them to the new location. Nothing really bad about doing that, but it does take time to scan through the code and make all the changes. So part of being a good developer is picking a good base address that would rarely, if ever be in use.
So getting back to Melinda’s laptop, the solution was simple. Microsoft had issued a hotfix which corrects the conflict in memory base address. If you are having this problem, check out Article 935448 at http://support.microsoft.com/, where you will find the hotfix that will fix your PC like it did Melinda’s.
Wednesday, April 4, 2007
A call came from the day manager at the Circle who said their ticket computer had gone “wacko” and he was wondering, needing, begging me to come by and take a look at it. Considering this computer was used to process all ticket sales, concession sales, and was also used credit card processing, I kind of understood why the manager was in a bit of a panic.
So repair procedure number one is to get from the non-technical description of “wacko” to something a little more technical about what was going on. The system was your typical HP desktop computer with Windows XP, with the only odd addition was a special touch screen monitor and point of sale software. After a little bit of Q&A over the phone, the deeper issues came to light. The computer acted hung or frozen after booting. The hard disk drive access light was flashing occasionally, the login screen was up, but when you clicked on a user account, nothing would happen. After repeated tries, the machine would eventually allow a user to login, but even after login the manager described it as incredibly sluggish.
Of course this gets your head spinning about all the others times you have seen slow PCs. Likely causes were a virus, zombie-ware, and worst case was a hard drive going bad. I know, somebody out there is saying, “a hard drive going bad, how could that make the system sluggish?” One of the warning sign of a hard drive starting to die is parts of the hard disk become unwritable. When the operating system runs out of free RAM, the OS will write a chunk of what is stored in RAM out to the hard drive (technically this is called paging). If a section of the hard drive has become damaged or just worn out, a write error will occur. The operating system will be determined to page the RAM to disk and will retry a few times and then give up, mark the this part of the hard drive as bad, and then try a new spot on the drive, wash, rinse, and repeat. All this time, the user of the computer has until the OS gets its way with the hard drive, so the system can look frozen or hung, and then suddenly come back alive when the OS finally finds a good place on the hard disk.
After driving over to the Circle, I watch and yes the computer acted just as described. The system would boot up fine, but take minutes to respond to a mouse click. You could keep clicking and clicking and then finally, something would load up. I headed back to my house to pick up some tools, disks, and a hard drive so I could make a complete backup of the system before trying to make any changes. Parting words to the manager was that it would be a good idea to find his old credit card imprint machine just in case we could not get this fixed today.
Made it home and started collecting things I might need: my laptop, a spare IDE hard drive, Ghost Boot Disk, screwdrivers, etc. This is the point where you have to appreciate the times when you get to look at a problem, and then walk away for a few minutes, because while I was gather everything up, I had a thought, “We didn’t really check the mouse”, so just to be safe, I grabbed a spare and headed back to the theatre.
After returning to the theatre, I tried my mouse theory whim, and holly crap, that was the problem. The right mouse click button on the old mouse had worn out. Like anybody standing at an elevator door hitting the button over and over thinking, this time it was making a difference. The sluggish response was simply the mouse button was only responding about every tenth click. Just to be safe, we checked the windows system event logs for disk problems and none were found. Did a spyware and virus scan and the system came back clean. Then we did a small dance because it was just a bad mouse caused by age, a spilled coke, or popcorn butter instead of something major.
Moral of the story is don’t forget to check out the simple things first before you bring out the big guns and assume it is the worst.