I have always been the guy in our network analysis team responsible for the actual capture of network packets. I bought all the recording hardware we used, acquired network TAPs of all sorts and speeds, and did most of the planning of where to put which engine.
One of the most complicated analysis jobs took two weeks to plan, and involved major headaches like SSL encrypted links, a load balancer, NAT devices and a huge VMware infrastructure. The VMware part was the biggest challenge of all, because we had to find a place where we could capture the traffic of three virtual machines running inside a DRS cluster, and we had to make sure we really didn’t miss anything coming or going to these servers.
Later, when I was teaching Wireshark courses at Fast Lane, the topic of capturing the traffic of virtual machines came up every once in a while when I spoke about data capturing methodology in class. Since I’m also a certified VMware instructor it happened more than once that another instructor teaching the Wireshark class asked me how to do this, and sometimes even pulled me into his own class to speak about capturing virtual machines for a few minutes. And since that topic seems to become more and more popular I thought it would be a good idea to write a little how-to about it.
Traditional capture setup
Usually, the first thing I do when you try to capture packets to solve a problem I determine the best location to set up your sniffer. It can either be put close to the client, or to the server, or somewhere in the network path between the two nodes. Sometimes, I use more than one capture location, for example at the client and the server, at the same time.
With purely physical networks the chances of selecting a good spot for the capture are pretty good – unless it is a very complex network with lots of redundancy and high speeds in the backbone. All you have to do is to find out which path the packets travel along and pick them up somewhere you like. Well, you need to determine if you can afford to use a SPAN port, or if you need to go by a TAP, but that’s usually it.
The virtual environment
In virtual environments things get a lot more complicated since there often is no physical spot where you can easily pick up packets from a single virtual machine. Consider the following example (and let’s assume all physical links are gigabit links):
Let’s say we want to take a look at anything the Mail server sends or receives. How do we do that? Well, we face a couple of problems here:
- We could do a SPAN port on the physical switch, but we don’t know what physical link the mail server will actually use since there are multiple network cards connecting the virtual switch to the outside world. We’d have to capture both at the same time, and not all switches allow us to do that.But let’s say it does. We still have the problem that we copy frames from two full duplex links to an “output only” link, which means in worst case situations there would be 2 times 2 GBps copied to one link with just 1 GBps output capacity. Guess what? The switch will drop frames right, left and center on the SPAN session – and our capture box will not even notice. Notice that capture filters won’t help either, because the frames do not get that far in the first place.
- The second problem is that not all packets the Mailserver might receive or send travel through the physical link. It could communicate with the Web Server, and none of the packets would ever have to leave the virtual environment, and you’d never capture them. They would simply travel from one server via the virtual switch to the other server.
- Let’s make things worse. Let’s assume we have an enterprise level virtualization plattform running, which im my case would mean VMware vSphere. With vSphere, most big environments run clusters of virtualization hosts, and that cluster is usually DRS enabled. DRS stands for “Distributed Ressource Scheduler”, which is kind of a virtual machine load balancer: it can move machines from one physical box to another on it’s own, at least if it is running in fully automated mode. Guess what happens to our SPAN session from problem #1? As I demonstrated live on my Sharkfest talks twice the capture will just not see any more packets as soon as the mail server is moved away from the physical host. In the capture, it looks like the communication had been cut in a single instance. At the same time, the mail server keeps sending and receiving as if nothing ever happened, but on a different physical link. Ouch.
It’s time for a little bet: I bet that you’ve thought about simply installing Wireshark on the mail server at least once so far. Easy, right? No fuss with SPAN, TAP, virtual environments and whatnot. Well, you’re right. But that “easy” way of capturing the mail server’s packets has serious flaws. I won’t go into much detail here, but capturing packets on the node in question is a pretty bad idea: depending on the server setup you’ll see ghosts like tons of CRC errors and huge over-sized frames. And that does not even include the fact that you’ve drastically changed your problem environment and risk system stability/performance while capturing. Virus scanner and personal firewalls may add further strange results, so – unless you really know what you’re doing – let’s forget about capturing packets directly on a node once and for all. It leads to the dark side! 😉 You can read more why local captures are a bad idea here.
Virtual capture setups
Okay, by now we know that capturing packets coming from or going to a virtual system most likely requires a new strategy. Somehow, we need to get our SPAN/TAP into the virtual environment, and if you’re searching the internet for this kind of thing you’ll probably find a couple of commercial product that will help you with that. And while they’re not bad I’d like to show you how to perform captures with what we already have, at least when running a VMware vSphere setup. I guess other virtual environments offer similar options but I haven’t worked with them yet, so I’ll stick to VMware here.
VMware vSphere offers two kinds of virtual switches: standard and distributed. The standard vSwitch is what every vSphere installation has, no matter what license it is running on. Distributed vSwitches are only available for those who have a Enterprise Plus license, so I’ll focus on the standard vSwitch in this post, which might look like this in a very simple environment:
You’ll notice that there are virtual machines on the left, and physical network cards on right right. There also is a so called “Port Group” called “Production”. All virtual machines have to be connected to a port group, and while some administrators think of it as simple VLAN groups they are more than just that. They are ports grouped together with a specific set of properties that we will take a closer look at soon.
Now, let’s assume that Server1 and Server2 are talking to each other and we need to capture what is going on, for example by capturing all packets of Server1. Fortunately there is a vSwitch feature called “Promiscuous Mode”, which should sound familiar if you’ve already captured data before. But other than the promiscuous mode we know from ethernet NICs (which means that it accepts all frames that arrive instead of filtering on its MAC) the promiscuous mode of a vSwitch basically means that it will become a hub (well, sort of): it will forward all packets to all ports, which means that all virtual machines will see all packets of all other machines as well. It should be obvious why vSwitches are not in promiscuous mode by default – it floods VMs with traffic they don’t care about, and it allows rogue VMs to sniff packets they should not be able to sniff. Of course sniffing packets is what we want to do, so promiscuous mode seems to be the way to get them. The only problem is that when you just do that on the vSwitch your production VMs could all drown in packets, especially if you have not just a few VMs like in the test setup above, but maybe a couple of hundred. And that’s where port groups come in.
The Port Group “Trick”
Port groups are used to partition a vSwitch, and as I already said most VMware administrators I talked to about port groups just think about them as a tool to create different VLAN groups. But they also have their own security settings, including a toggle for promiscuous mode, which means that you can enable promiscuous mode just for some VMs and avoid the huge packet flood. And port groups have a feature that often comes as a surprise: you can create multiple port groups with identical settings. So if you need to capture the traffic of a VM like “Server1” in the example setup you can do what I do:
- Create a temporary port group with settings identical to the one Server1 is connected to. This means that you’ll have to make sure that the VLAN setting is exactly the same.
- Move the Server1 VM to the temporary port group. Whenever I did this I in the past I did not lose any connection the server had at that moment – but of course I can’t guarantee that it won’t in your case.
- Create a capture VM running e.g. Wireshark and connect it to the same temporary port group:
- Enable promiscuous mode on the temporary port group by setting the override checkmark for “Promiscuous Mode” and chose “Accept” instead of “Reject”:
- Log into your capture VM and capture packets. When capturing with a Windows machine I usually disable all protocol bindings on the network card to force it to become completely passive:
- Analyze 😉
Well, of course you’ll have to move the VM back to the original port group when you’re finished and remove the temporary port group. There are two things you need to keep in mind:
- You’ll have to consider the amount of disk I/O that your capture VM will generate by writing packets to it’s VMDK, so please make sure that you don’t get into trouble with your storage administrators by putting additional load on their storage system.
- Very important: the capture of traffic using port group promiscuous mode only works if the capture VM is on the same ESXi host as the VM that you want to capture the traffic of. Otherwise you’ll only see broadcast/multicast packets. So you need to make sure that you move all VMs to the same ESXi host before you start the capture.