Extracting 'HTTP CONNECT' Requests with Python

Published: 2022-11-14. Last Updated: 2022-11-14 02:35:27 UTC
by Jesse La Grew (Version: 1)

Seeing abnormal Suricata alerts isn’t too unusual in my home environment. In many cases it may be a TLD being resolved that at one point in time was very suspicious. With the increased legitimate adoption of some of these domains, these alerts have been less useful, although still interesting to investigate. I ran into a few of these alerts one night and when diving deeper there was an unusual amount, frequency, and source of the alerts.

Figure 1: Suspicious Suricata Alerts

The source indicated that the alerts were coming from a dedicated internal firewall on my network, which is used to gather additional data on Honeypot attack traffic. The source ended up being my DShield honeypot. These alerts have come up before, but the amount was very unusual. Since this traffic wasn’t being shown in my standard web honeypot logs, I decided to look at local PCAP captures.

Figure 2: PCAP HTTP CONNECT Requests from Wireshark

The data showed a variety of HTTP CONNECT requests that were arriving at the honeypot. HTTP CONNECT requests are often used with proxy servers to open a connection to a desired destination [1]. Looking into any one of the streams didn’t give much additional information since the CONNECT requests were directing to encrypted HTTP connections.

Figure 3: TCP Stream of HTTP CONNECT Request from Wireshark

There were Zeek and other data available to summarize this information but decided to pull together a python script to process the PCAP files. The goal was to understand the scale of these requests and the change over time.

from scapy.all import *
from scapy.layers import http
from collections import Counter
import os
import time

def print_header(header_text):
    print("\n\n")
    print("{:>70s}".format("//////////////////////////////////////////////"))
    print("{:>50s}  {:>10s}".format(header_text, "Count"))
    print("{:>70s}".format("//////////////////////////////////////////////"))

directory = os.getcwd()

csv_export = open("http_connect_info.csv","a")
csv_export.write("Epoch Time,Date,Source IP,Destination Port,HTTP CONNECT Path,HTTP CONNECT Host\n")
src_ips = []
dst_ports = []
connect_paths = []
connect_hosts = []

for filename in os.scandir(directory):
    if ".pcap" in filename.path:
        print("Processing file: " + filename.path)
        for pkt in PcapReader(filename.path):
            if pkt.haslayer(http.HTTPRequest):
                if pkt.Method.decode() == "CONNECT":
                    src_ip = ""
                    dst_port = ""
                    connect_path = ""
                    connect_host = ""

                    if pkt[IP].src is not None:
                        src = pkt[IP].src
                    if pkt[IP].dport is not None:
                        dst_port = pkt[IP].dport
                    if pkt[IP].Path is not None:
                        connect_path = pkt[IP].Path.decode()
                    if pkt[IP].Host is not None:
                        connect_host = pkt[IP].Host.decode()                                                                 

                    print(str(pkt.time) + ", " + time.strftime('%Y-%m-%d %H:%M:%S %z',time.localtime(float(pkt.time))) 
                        + ", " + src + ", " + str(dst_port) + ", " + connect_path + ", " + connect_host)
                    csv_export.write(str(pkt.time) + "," + time.strftime('%Y-%m-%d %H:%M:%S %z',time.localtime(float(pkt.time))) 
                        + "," + src + "," + str(dst_port) + "," + connect_path + "," + connect_host + "\n")
                    src_ips.append(src)
                    dst_ports.append(dst_port)
                    connect_paths.append(connect_path)
                    connect_hosts.append(connect_host)

src_ip_counts = Counter(src_ips)
dst_port_counts = Counter(dst_ports)
connect_paths = Counter(connect_paths)
connect_hosts = Counter(connect_hosts)

print("\n\n")
print_header("Source IP")
for each_item in src_ip_counts.most_common():
    print("{:>50s}  {:10d}".format(each_item[0], each_item[1]))

print_header("Destination Port")
for each_item in dst_port_counts.most_common():
    print("{:>50d}  {:10d}".format(each_item[0], each_item[1]))

print_header("HTTP Connect Path")
for each_item in connect_paths.most_common():
    print("{:>50s}  {:10d}".format(each_item[0], each_item[1]))

print_header("HTTP Connect Host")
for each_item in connect_hosts.most_common():
    print("{:>50s}  {:10d}".format(each_item[0], each_item[1]))

This script reviews all the *.pcap files in the current directory, prints out a basic summary of the HTTP CONNECT requests and also saves the data to a CSV file.

Figure 4: Destination Port and HTTP CONNECT Request Path Counts

Figure 5: HTTP CONNECT Request Host Counts

Figure 6: HTTP CONNECT Request Source IPs

For a small snapshot of a day or two, it was completed processing within an hour or so. I was curious how this compared to historic data. I ran the same script against 6 months of PCAPS. This took over a day to process. Using a tool such as Zeek [1] would likely be quicker to get this information. The http.log file of Zeek would have the information and a utility like zeek-cut [2] could help get the raw requests.

An item that stood out when looking at the data was that recent HTTP CONNECT requests had greatly increased this month and especially in the last week.

Figure 7: Graph of HTTP CONNECT Method Requests by Month Since May 2022

Figure 8: Graph of HTTP CONNECT Method Requests by Day in November 2022

Top 10 HTTP CONNECT Path Ports

HTTP CONNECT Path Port	Count
443	64681
27115	7876
25565	1871
25900	919
30125	529
22	483
30120	468
3389	467
80	446
53	417

Top 10 HTTP CONNECT Source IP Addresses

HTTP CONNECT Source IP	Count
142[.]202[.]242[.]113	16164
69[.]30[.]246[.]66	11354
204[.]12[.]248[.]130	10902
65[.]109[.]19[.]42	9740
209[.]222[.]97[.]249	6747
69[.]30[.]243[.]18	3729
172[.]93[.]100[.]135	3557
142[.]202[.]243[.]109	2667
104[.]251[.]122[.]239	1759
167[.]99[.]176[.]180	1537

Top 10 HTTP CONNECT Paths

HTTP CONNECT Path	Count
28sex[.]com:443	16357
109[.]237[.]111[.]71:27115	7876
beo555[.]co:443	4620
beo333[.]com:443	4442
h5[.]xhlax[.]com:443	3764
www[.]korims[.]com:443	3119
www[.]serruriervaud[.]ch:443	1872
share[.]nuox[.]top:443	1730
18[.]140[.]35[.]119:443	1464
keokeo[.]top:443	1144

Python can be a great way to programmatically extract data from a PCAP and use that data for other purposes, such as data enrichment or summarization. It was an easy way, if other tools were unavailable, to easily summarize HTTP requests. For larger pools of data, using other tools such as Zeek can also be extremely useful.

The HTTP CONNECT requests may have been an attempt to relay traffic through the honeypot and hide the original source of the request. It is also possible that the traffic may have been funneled through multiple proxy endpoints to make identification of the source difficult to identify. Allowing HTTP CONNECT on internet facing resources can potentially expose internal network resources or assist in the forwarding of malicious traffic. A majority of the HTTP CONNECT requests were directed at port TCP 8080 (99.5%) with the remaining aimed at TCP 80.

[1] https://www.rfc-editor.org/rfc/rfc9110.html#name-connect
[2] https://docs.zeek.org/en/master/about.html
[3] https://docs.zeek.org/en/v3.0.14/examples/logs/index.html

--
Jesse La Grew
Handler

Keywords: http connect pcap python

1 comment(s)