For those of you who deploy Skype for Business regularly for customers will be all too aware, often one of the biggest challenges is not deploying the technology, but actually making sure that everyone understands the requirements for a successful deployment. From the technicians who build the OS on the servers, to the architects, project managers to networks, security teams and other third parties that may need to be involved during certain stages of the deployment.
The biggest challenge, by far, is ensuring that your requirements are actually listened to and acted upon for when you need them. If you manage these two key challenges, then you will be a Skype for Business deployment wizard. I am still learning it seems, as this post is born out of some real world pain in this area.
I have been deploying a “cloud first” hybrid solution for a customer. By cloud first, I mean the customer had a trial of Skype for Business Online first without any on-premises infrastructure. Why are we moving back you may ask? Good question, answer is tactical and cannot disclose here. Anyway, one of the key elements that needs to work with any hybrid, cloud first or on-premises first is federation.
For cloud first hybrid’s this is significantly more important that on-premises first. Without properly configured Edge servers, certificates, DNS records and firewall rules federation between on-premises and Office 365 Skype for Business (or indeed any other partner) will simply not work, or work in one direction or the other.
Consider the scenario where I had 50 users consuming Skype for Business Online in a pilot deployment (come slightly production it seems). I planned the process in a way that I could pull these users back to on-prem at a later date with minimal impact. So I had AADConnect setup with same sign on and UPN matching SMTP address (because of internal change control).
Some weeks later I return to deploy the on-prem solution. I need to be mindful on DNS records especially in order not to loose automatic discovery and sign in for the cloud users. I also need to ensure that the Edge servers are properly configured with the correct firewall rules and NAT in advance. I need to be 100% sure that its going to work, because come the time where I make the DNS changes to point from Office 365 to on-prem this is where the proverbial will hit the fan if something is not in place as it should be.
So pre DNS change checklist
- Edges installed –> check
- Certificate chain installed –> check
- Certificate chain built externally –> check
- Telnet to all ports from external source (443, 5061) –> check
- Telnet from edge to external partner Skype for Business (443, 5061) –> check
- Hosting Provider added for Office 365 –> check
- Tenant configured for shared SIP space –> check
- Reverse Proxy URLs working externally (using hosts file on remote machine) –> check
Pre DNS checklist now complete, I should be good to go, so let’s make the DNS change. OK Done. Now everything appeared to continue to work, but then it would because DNS caches and servers take up to 48 hours to complete worldwide. The real test would be hour 49+.
Hour 49 came, Houston we have a problem. Cloud users cannot federate with on-prem users. Cloud users cannot federate with external partners. On-prem users cannot federate with cloud users. On-prem users cannot federate to external partners, but external partners can see on-prem user’s presence and send an IM. External partners cannot federate with cloud users.
So clearly, federation is busted. Usual troubleshooting tool – CLSLogger and Snooper to the rescue
Snooper tells us that we have a Server timeout 504 error. And the reason is that we failed to negotiate TLS with the peer and therefore the peer closed the connection to port 5061. Let’s take a look at the ms-diagnostics error number 1047. Now consulting Paul Bloem’s (MVP) reference page (http://ucsorted.com/2013/08/28/ms-diagnostic-id-reference-page/) and find out what this means. We can see from the table that we appear to have a certificate problem.
Not sure how I could have a certificate problem, I used the CSR from the edge server, using the deployment wizard, I am using a tier 1 SSL provider i.e GlobalSign, so the certificate should be trusted by the peer. Testing on SSLLabs showed that the server was sending the certificate along with it’s intermediate to https requests, and it was trusted. I had the private key, even ran Digicert’s SSL tool and that showed no error.
The error was leading me down the path of an incorrect certificate, but double, triple and even quadruple checking my configuration showed that I had not made a mistake with the certificate. It couldn’t be that, unless GlobalSign suddenly revoked their root certificate.
I took a step back and then it dawned on me. SSLLab tests and Digicert test the certificate using HTTPS protocol. But federation uses SIP TLS on port 5061, not https at all. Then I remembered I could telnet to Office 365 and other partners on 5061 so this should work. Playing devils advocate, I installed the whole certificate chain to my lab edge server, and attempted federation again. Some people have cited this as a fix for this particular error. For me this did not resolve it. Furthermore, I would not have been happy sending the intermediate and root cert to every partner, its not manageable and I would like to hear what Microsoft would have to say if we politely asked them to install “this” certificate to their entire Office 365 cloud.
So, now I have ruled out anything to do with the certificate. What else? Surely, the security team have not added deep packet inspection to the outbound rule? They haven’t for the inbound rule so why would they for the out? Let’s investigate.
Firstly I don’t have access to the firewall, so I will have to do this the old fashioned way and prove it at packet level. First lets take a capture of the edge external access interface
So here I can see that the edge server is sending and trying to complete TLS negotiation to the peer edge. But the peer edge, does not respond to the TLS requests when the source edge sends Client Hello. Instead a RESET packet is received from the peer edge closing the connection.
Normally, this would mean that the problem seems to be the peer end, and usually that would be right, and this is where the recommendation to send your certificate chain to the peer administrator to install on their edges would “get around” the problem. In my case I knew this was unreasonable, and I wanted to see what the peer edge actually received and responded with.
From my lab edge I created a simultaneous capture
WOW, here I see the source edge sending a SYN packet, and my edge also sending a SYN followed by replying with an ACK(nowledgement). Then, nothing… finally a RESET coming from the source edge!
Going back to the original snooper and SIP error, we can see that it said that “the peer closed the connection”. In this case it seems both the source edge and the peer edge were equally responsible for tearing down the connection as each received a RST packet from the other.
More importantly, my lab edge does not even see the TLS requests coming from the source edge.
The problem: Packet Inspection
It seems after all that DPI has been enabled on the outbound rule. The DPI rule stated that TLS traffic should have a destination port of 443 and protocol HTTPS. This explains why HTTPS certificate checking succeeded. And also explains why my lab edge cannot see the TLS request over port 5061/SIP. The source firewall was simply refusing to transport the packet because it failed DPI.
For the record, Skype for Business does not support DPI, it should be disabled along with SIP ALG. Please ensure that you make this abundantly clear throughout the process to avoid falling in the same hole as me.
As you can probably tell, I write this post with a degree of sarcasm and frustration. However, on a serious note, this took a lot of time to find and prove, so I hope that reading this will help you save yours if you ever come across the same or similar issue.