Trace file anonymization, trace file sanitization… it seems like I can’t decide whether to call it “Sanitization” or “Anonymization” – even in my code base it is sometimes called the first, sometimes the latter. Of course there is a small difference between the two – one is removing sensitive data by cutting it away, while the other replaces it with something generic.
So far I have presented two talks at Sharkfest, the first in 2011 and the second in 2013. There are lots of programs and scripts out there that try to help with that kind of thing. And yet I created another one myself, for a couple of reasons. Some of them are:
- None of the existing tools can read and write the PCAPng file format, so using them would require me to convert my files to PCAP first, which is unacceptable since I would lose a lot of important details in the process.
- The most common tools like bittwiste and tcprewrite are part of tool kits created for packet replay and are basically command line packet editing tools. Both seem to patch data into existing packets, replacing details like IP addresses at specific offsets. This is very problematic when the information that should be replaced is not found at the offsets where the tools think they should be, e.g. when bittwiste stumbles across VLAN tags. Both can’t deal with complex protocol sequences, like IPv6 tunneled over IPv4, maybe even with protocols like AYIYA that allow NAT traversal by using a UDP layer.
- A big problem of sanitizing trace files is the fact that it is quite that not all sensitive information is removed or replaced, simply because it was overlooked or its location or relevance unknown. For example IP addresses: there are so many places where they may be present – of course in the IP header, but also in the IP options, other layers like ARP, DNS, DHCP, or even in TCP options. So to prevent accidental leakage of information a sanitization process must be performing a “Defensive Transformation” as described by the authors of pktanon at https://telematics.tm.kit.edu/english/article.php?publication_id=298&language_id=2. Which also means that a tool needs to understand each byte of a frame to be allowed to anonymize it, and otherwise has to discard it if it can’t figure out what it stands for.
- When cutting away the layers that can’t by anonymized, many tools (or even all of them, since I haven’t found one yet that works) fail to reduce the captured size of a frame instead of setting the wire size to the remaining frame size. When Wireshark sees a shorter wire size than the original had you will drown in lost segment messages because the TCP expert thinks that there is a gap in the TCP payload data.
- I want maximum control over the sanitization/anonymization while making it easy and comfortable to configure the tool. Also, a sanitization process should do everything in one go, because I hate issuing two dozen command line calls to bittwiste just to replace a couple of things one after the other.
Sanitization tools for Network Analysts
There are a lot of tools out there to sanitize or anonymize trace files, but as far as I can tell none are written with the focus on preparation of files for tool demonstration purposes and teaching protocol analysis. It’s great that you can replace tons of different fields and basically toggle every single bit in the TCP header, but that doesn’t help me if the resulting file is so heavily modified that I can’t use it to show the things anymore that I wanted to use it for. For example, one of the easiest thing to do it to randomize the IP address by rolling the dice and generating a new address to replace the original. But to do it right it needs to
- be consistently replaced throughout the whole file or even multiple files. Most tools can do that. But wait, there’s more.
- it should not change it’s type from Unicast to Broadcast or Multicast, so the replacement needs to be of the same address class as the original. This gets a bit complicated for IPv6, because you also need to keep track of multicast addresses like the solicited node multicast address, link local addresses, and other things
- it’s possible (even if unlikely) that for two different original IP addresses the randomization process generates the same random IP by sheer “luck”. This is not good since it would look like the same IP is doing things in the sanitized file that it never did in the original. So any generated address needs to checked for it’s uniqueness and re-rolled if it is a duplicate.
- The IP address may appear in lots of different places, like the IP, ARP, DNS and DHCP layers, ICMP “quotes” etc, and all of those occurrences need to be replaced by the same new address
It gets a lot more complicated than just replacing addresses correctly though. In network analysis, it is often important to see if data was fragmented, or what size a packet had, especially when tracking TCP sequence numbers. The problem with anonymization is that when you replace a value the resulting frame may be shorter or longer than the original. This can easily happen e.g. when removing VLAN layers from a frame, or when certain TCP options are left out or written in a different style than in the original. If replacing FQDNs in DHCP host name options or DNS packets you easily change the size of the packet – and DNS is a nightmare on it’s own when having to write pointers to the payload bytes. And then there are even greater challenges like EUI-64 in IPv6, where you can’t just randomize the address because it needs to keep its relation with the sanitized MAC address you don’t know yet, because you need to process the IP layer before even getting to the Ethernet layer…
The Solution: TraceWrangler
The tool I wrote in the year between Sharkfest 2012 and 2013 (well, actually I work on the basic libraries a lot longer than just a year) is called TraceWrangler. It’s primary objective is to handle PCAPng files and process them, sort of like a “Swiss army knife” of trace file manipulation.
The biggest part at the moment is the anonymization/sanitization functionality, which I demonstrated at Sharkfest 2013. It is a stand-alone Windows 32bit or 64bit executable that reads and writes trace files and allows you to anonymized/sanitize frames with an easy configuration dialog. I think the main advantage is that it is a tool that I want to use myself, and while writing it I’m focused on keeping it easy to use while as powerful as possible, and my approach is that I always think about it being useful from the perspective of a network analyst – and not for packet replay or other purposes where the integrity of the file isn’t that much of an issue.
still an Alpha a beta version, so you should use it with caution. Things may go wrong while processing files, so you need to keep the originals – and that is a rule of thumb that is always valid anyway. Because when you diagnose something in the processed file you have to verify it in the original if it is available.
Get TraceWrangler here: https://www.tracewrangler.com