Sniffer Placement In the Data Center
This post is a follow on of my last post on data center sniffers.
It’s critical to strategically place your sniffer in the data center to ensure you get the traffic you’re looking for and ensure the traffic is valid. Traffic will become invalid if you Ethernet circuit is oversubscribed, your sniffer is oversubscribed, if you’re missing some of the traffic, or if you’re getting the wrong traffic.
Ask these questions:
- Do I want to capture all of the traffic in my data center?
- In order to capture all traffic you need to ensure you have enough disk space and that your SPAN circuits are not oversubscribed. This may require multiple sniffers which require an aggregator to properly view the data.
- Do I want to capture a portion of the traffic in my data center?
- If you want to capture relevant data and disregard other data you need to find a way to remove that data from the picture.
- You can filter at the sniffer. This is only a valid option if your sniffer is not already oversubscribed by the data being mirrored. If the span ports are oversubscribed then you risk having invalid data stored on your sniffer even if you filter at the appliance level. The sniffer will ‘think’ that there is packet loss and your troubleshooting tool may be useless.
- You can filter VLANs at the switch. If the traffic you want to block is on a specific Vlan you can filter this Vlan on your SPAN/Mirror configuration. This is helpful for such traffic as database replication, data center replication, and tape backups which can take up the bulk of your disk space. This only works if that traffic is segregated.
- You can filter with an ACL. Some core switches, such as the Cisco 6500 and the Juniper QFX3500 (and just recently the c7000), support vlan ACLs. These can drop traffic on your mirror port before it is forwarded across the link. My only issue with vACLs is that they are super buggy and I do not trust them. They also may or may not be enforced in hardware and could potentially
- You can filter with a traffic filtering appliance. I do not believe there is an official name for these yet. Netoptics has it’s SmartFilter. Gigamon has it’s Traffic Visibility Node, Apcon has it’s Inovative Solution, and Anue has its Net Optimizer. In any case these are chassis that look like switches. You connect the ports you want monitored to the ports on these ‘matrix switches’ and you connect your monitoring sniffer to the them as well. The matrix switches then have a UI that allow you to configure how you want your traffic forwarded from your SPAN to Sniffer ports. These are very powerful devices, however, they can be just as expensive (or more so) than your sniffer! It’s very difficult to justify the sniffer’s cost much less these devices. They are pretty awesome though and are critical if you want to keep ONE sniffer traffic analysis but have high traffic throughput in your data center.
- You can also just filter what ports you monitor. This may or may not be an option. Say you just want user traffic? Then you can just monitor your WAN uplink ports and not worry about your downstream datacenter LAN ports. This can cut the traffic significantly by bypassing server to server through the core traffic. This is ok when the bulk of your tickets comes from the user and not the server guys. For me I’d say the ticket depth is more along the lines of 70% server guys and 30% users during deployments and 70% users and 30% server guys during production. In other words this method does not work for me.
Do I care about getting duplicate packets?
- Modern sniffers have dedupe functionality so do I really care about getting duplicate packets? Yes! The sniffers do not dedupe the hard pcap data and duplicate packets add additional load.
Where should I install and configure my sniffer?
–This is the big question. You can install at many points in your network. I’ll go over a few, sans the matrix switch implementation.
Install at your core and SPAN Vlans
Here I’ve installed a sniffer in the core (or aggregation). I have configured two 10Gig uplink, one from each core switch, to terminate on the sniffer. These are the monitor/span/mirror ports. In the core switches’ configuration I’ve configured a SPAN session and selected the Vlan traffic that I want to view. I’ll select RX only so as not to receive duplicate packets. This is an easy configuration but the core switches have limitations on how many Vlans can be added to a monitor session. The limit is only 32 in Nexus.
Here I’ll configure all uplinks to the WAN routers as my SPAN source and select both RX and TX. This will give me a picture of all traffic into and out of my data center but will not give me a picture of intra-data center traffic ie server to server traffic. The limitation in Nexus on the number of physical source interfaces is 128. This method saves disk space but does not let help troubleshooting internal data center issues.
Here I’ve configure my LAN and WAN facing uplinks as my SPAN source and I’ll select RX only. I do not SPAN my inter-chassis uplinks so as not to receive duplicate packets. This implementation will give me traffic insight into ALL of my datacenter traffic (except for packets originating at the core). I’ve found this to be the best model giving me view into everything. I’ve also found that this oversubscribes the ARX pretty quickly and so I filter, within my SPAN session, replication Vlans and Tape backup Vlans. This model has been good enough for most data centers I’ve supported (less than ~100 server racks) but the device becomes oversubscribed after ~100 racks even with no replication or backups. I then have to just say screw it and be oversubscribed, remove interesting traffic from my capture and hope for the best, or use a matrix switch.
Recently Nexus has came out with hardware ACL capture. This allows for an IP access-list to be applied to an interface for a SPAN session. I’ve not used this yet but it looks promising (and looks totally buggy). It also turns off ACL logging so I do not really know if this is something I can use.
Do not forget about your data center services
You’re bound to have firewalls, load balancers, and other services in your data center. Make sure you span, RX only, to those as well as these data paths are always of interest. Not spanning here can potentially provide you with a unidirectional data cap.
Finally I’ll leave off with a gripe. I often have users who have issues with webpages. All webpages these days are encrypted. That means that the pcap data stored on the sniffer is more or less useless for above L4 analysis. This is also true with Microsoft Lync in which all SIP sessions are encrypted. For that reason I think that SSL acceleration should not be done on the load balancer, server, or WAAS. I feel that we should be given an option to terminate SSL, and other encryption methodologies, on the sniffer! Then the all traffic for pcap data would be unencrypted and the voice and http modules would actually WORK!
I’m probably crazy suggesting such a thing but I could see how this could be so helpful to some troubleshooting I’ve done in the past.