import pandas as pd
import numpy as np
import functools
from matplotlib import pylab as plt
%matplotlib inline
!tshark -2 -r mads-141.0.174.38.pcap -Tfields \
-eframe.time_epoch \
-eip.id -e ip.ttl -e ip.dst \
-etcp.srcport -etcp.dstport -etcp.len -etcp.stream -etcp.seq -etcp.ack -etcp.flags.str -etcp.window_size \
-etcp.options.timestamp.tsval -etcp.options.timestamp.tsecr \
-etcp.analysis.initial_rtt -etcp.analysis.ack_rtt \
-ehttp.host -ehttp.response.code -ehttp.user_agent -ehttp.location \
> mads-141.0.174.38.txt
data = pd.read_csv('./mads-141.0.174.38.txt', delimiter='\t',
names=
('time_epoch id ttl dst srcport dstport len stream seq ack flags window_size '
'tsval tsecr initial_rtt ack_rtt host code user_agent location').split())
data['intid'] = data.id.apply(functools.partial(int, base=16))
data.time_epoch -= data.time_epoch.min()
data.sample(1).T
len(data.stream.unique())
data.host.value_counts()
www.xnxx.com stream is not really interesting as it's just a single webpage that occasionally hit the dataset.
data.location.value_counts()
There are three different sorts of streams in the dataset. Good streams get correct redirect, bad streams get injection, ugly streams get no redirect at all.
s_good = set(data[data.location == 'http://www.xnxx.com/'].stream)
s_bad = set(data[data.location == 'http://marketing-sv.com/mads.html'].stream)
s_ugly = set(data[data.host == 'xnxx.com'].stream) - set(data[~data.location.isnull()].stream)
print 'Bad:', sorted(s_bad)
print 'Ugly:', sorted(s_ugly)
data[(data.stream.isin(s_good)) & (~data.code.isnull())].ack_rtt.hist(color='green', normed=True)
data[(data.stream.isin(s_bad)) & (~data.code.isnull())].ack_rtt.hist(color='red', normed=True)
plt.xlabel('RTT, s'); plt.ylabel('Density');
There are not so much streams outside of 250ms range, so let's look at high-res histograms at that range.
data[(data.stream.isin(s_good)) & (~data.code.isnull())].ack_rtt.hist(bins=np.linspace(0, 0.25, 25), color='green')
plt.xlabel('RTT, s'); plt.ylabel('Count');
data[(data.stream.isin(s_bad)) & (~data.code.isnull())].ack_rtt.hist(bins=np.linspace(0, 0.25, 25), color='red')
plt.xlabel('RTT, s'); plt.ylabel('Count');
We have only six samples of injected redirects, but five of these samples have significantly lower RTT than the usual RTT of http response. The ~45ms RTT corresponds well to the latency of last-mile ADSL link that was used during this analysis, so it is close to latency to get a packet from ISP's network.
d_good = data[data.stream.isin(s_good) & (data.srcport == 80)]
d_good[d_good.tsval.isnull()].shape
So, good data from server always has TCP timestamp, there are no rows in the slice.
d_good.ttl.value_counts()
So, good data from server always has TTL=54.
Let's look at IP fragment ID field:
dsa = d_good[d_good.flags == '*******A**S*']
plt.scatter(dsa.time_epoch, dsa.intid, marker='.')
plt.xlabel('Time since 1st packet, s'); plt.ylabel('IP ID')
plt.title('IP ID for SYN-ACK packets from server');
dsa = d_good[d_good.flags != '*******A**S*']
fig = plt.figure(); fig.set_figwidth(15); fig.set_figheight(3)
ax = fig.add_subplot(1, 2, 1)
ax.scatter(dsa.time_epoch, dsa.intid, marker='.')
ax.set_xlabel('Time since 1st packet, s'); ax.set_ylabel('IP frag. ID')
ax.set_title('IP ID for non-SYN-ACK packets from server');
ax = fig.add_subplot(1, 2, 2)
dsa.intid.hist(bins=32, ax=ax)
ax.set_xlim(0, 2**16)
ax.set_xlabel('IP frag. ID'); ax.set_ylabel('Packets')
ax.set_title('IP ID for non-SYN-ACK packets from server');
print 'Min/Max IP ID observed for non-SYN-ACK packets:', dsa.intid.min(), dsa.intid.max()
Good server replies with IP-ID=0 in SYN-ACK and almost never has IP-ID=0 in other packets, IP-ID is rather random for other packets.
plt.scatter(d_good.time_epoch, d_good.window_size)
plt.xlabel('Time since 1st packet, s'); plt.ylabel('TCP Window, bytes');
plt.title('TCP Window announced by server');
Good server has window size in 25k…31k range (scaling is applied).
print sorted(s_bad)
d_bad = data[data.stream.isin(s_bad) & (data.srcport == 80)]
d_bad['time_epoch stream id ttl len seq ack flags window_size tsval ack_rtt'.split()].sort_values(by=['stream', 'time_epoch'])
So, all packets that look-like-injected have:
The server also sends 408 Request timeout in 120 seconds. It means, that the server has not seen the request at all, so the injector act as an in-band device.
Also ACK that is confirming FIN-ACK is looks like injected according to IP ID and TTL, but it has weird RTT (~98ms, but not ~44ms).
That's injected stream that has ~200ms latency. On the other hand, genuine SYN-ACK from the server also has larger-than-usual RTT, so it's probably just a temporary Bufferbloat lag.
d_bad[d_bad.stream == 287]['time_epoch stream id ttl len seq ack flags window_size tsval ack_rtt'.split()]
The interesting thing about this stream is that ACK confirming FIN-ACK has 18ms ACK_RTT, so it actually means that the packet was likely sent BEFORE seeing the FIN from the client as the last-mile RTT is ~38ms according to mtr measurements.
If the statement is actually true, then another question arises: why is ACK-confirming-FIN-ACK usually ~98ms delayed? Is it triggered by some packet from original server? Is it sort of latency camouflage? No further research was done yet to clarify these questions.
d_ugly = data[data.stream.isin(s_ugly) & (data.srcport == 80)]
d_ugly.groupby(by='stream tsecr'.split()).time_epoch.agg(['count'])
It means, that the remote server has seen SYN packet and the first ACK after the SYN, but the server has never seen the request itself. It suggests that the ugly streams are just a sort of bad streams those got no redirection packet for some reason.
It's interesting that only mobile User-agents were redirected to the …/mads.html. Our test sent ~33% of requests using mobile User-Agent and 67% of requests using desktop User-Agent.
d_goo = data[data.stream.isin(s_bad | s_ugly)]
d_goo.user_agent.value_counts()
It explains why OONI dataset sees no redirection. We've seen redirections only for mobile User-Agent so probably the DPI targets mobile users.
print 'Redirection happens in %.1f%% cases' % (100.*len(s_bad|s_ugly) / len(set(data.stream)))
print 'Redirection happens in %.1f%% of mobile cases' % (100.*len(s_bad|s_ugly) /
len(set(data[data.user_agent.str.match('.*(?:Android|RIM|Symbian|Series60|iPhone|BlackBerry|MIDP)', as_indexer=True) == True].stream)))