HackTheBox Academy - Information Gathering - Web Edition

Updated 29-03-2026

This module equips learners with essential web reconnaissance skills, crucial for ethical hacking and penetration testing. It explores both active and passive techniques, including DNS enumeration, web crawling, analysis of web archives and HTTP headers, and fingerprinting web technologies.

Introduction

  • The primary goals of web recon:
    • Identifying assets
    • Discovering hidden information
    • Analyzing the attack surface
    • Gathering intelligence

Types of Reconnissance

Active Recon

  • In active recon, the attacker directly interacts with the target system to gather info using the following methods:
    • Port scanning
    • Vulnerability scanning
    • Network mapping
    • Banner grabbing
    • OS Fingerprinting
    • Service enumeration
    • Web spidering

Passive

  • In passive recon, the attacker gathers information about the target without directly interacting with it
  • This relies on publicly available information and resources, such as:
    • Search engine queries
    • WHOIS lookups
    • DNS
    • Web archive analysis
    • Social media anaylsis
    • Code repositories

WHOIS

  • WHOIS is a protocol used to query databases that store information about registered internet resources like domain names, IP address blocks and autonomous systems
  • A WHOIS record contains the following information:
    • Domain name
    • Registrar
    • Registrant contact
    • Administrative contact
    • Technical contact
    • Creation and expiration dates
    • Name servers
  • Using WHOIS:
1
2
3
4
sudo apt update
sudo apt install whois -y

whois facebook.com

DNS & Subdomains

DNS

  • DNS translates domain names into numerical IP addresses
  • The hosts file is used to map hostnames to IP addresses and is located in C:\Windows\System32\drivers\etc\hosts on Windows and in /etc/hosts on Unix
  • In DNS, a zone is distinct part of the domain namespace; for example, example.com, mail.example.com & blog.example.com all belong to the same DNS zone
  • A zone file is a text file that resides in the DNS server which defines the resource records within a zone (NS records, MX records, A records, etc.)

Digging DNS

  • Most popular DNS recon tools:
    • dig
    • nslookup
    • host
    • dnsenum
    • fierce
    • dnsrecon
    • theHarvester
    • Online DNS Lookup Services
  • dig:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    dig google.com

    # IPv4 address
    dig domain.com A

    # use specific name server for query
    dig @1.1.1.1 domain.com

    # show full path of resolution
    dig +trace domain.com

    # reverse look up
    dig -x 192.168.1.1

    # short answer to the query
    dig +short domain.com

    # shows answer section only
    dig +noall +answer domain.com

    # all avaialable DNS records
    dig domain.com ANY

Subdomain Bruteforcing

  • There are several tools that excel at bruteforce enumeration:
  • dnsenum:
    1
    2
    3
    # brutefoce subdomains
    # -r enables recursive subdomain brute-forcing; meaninig it will enumerate subdomains of a subdomain it found
    dnsenum --enum inlanefreight.com -f /usr/share/seclists/Discovery/DNS/subdomains-top1million-20000.txt -r

DNS Zone Transfers

  • DNS Zone Transfers is an alternative and less invasive method for uncovering subdomains
  • It’s a wholesale copy of all DNS records within a zone from one name server to another
    1
    dig axfr @nsztm1.digi.ninja zonetransfer.me

Virtual Hosts

  • Virtual hosting allows web servers to differentiate between domains, subdomains or separate websites with distinct content
  • it allows multiple websites or application to be hosted on a single server
  • If a vhost doesn’t have DNS record, it still can be accessed by modifying our hosts file to map the domain to an IP address
  • Virtual host discovery tools:
  • gobuster:
    1
    2
    3
    4
    5
    6
    7
    8
    # --append-domain is required to append base domain to each word
    gobuster vhost -u http://<target_IP_address> -w <wordlist_file> --append-domain

    # Section Solutions
    # search all files that match lines starting with web (^ = start of line)
    # sort -u (-u means unique) to remove duplicates
    grep -h ^web /usr/share/wordlists/seclists/Discovery/DNS/* | sort -u > web.txt
    gobuster vhost -u http://inlanefreight.htb:30804 -w web.txt --append-domain

Certificate Transparency Logs

  • There are two popular options for search CT logs:
    • crt.sh
    • Censys
  • crt.sh also offers API for automated searches
    1
    2
    curl -s "https://crt.sh/?q=facebook.com&output=json" | jq -r '.[]
    | select(.name_value | contains("dev")) | .name_value' | sort -u

Fingerprinting

  • Techniques used for web server and technology fingerprinting:
    • Banner Grabbing: often reveals server software, version numbers and other details
    • Analyzing HTTP Headers: they typically disclose the web server software; X-Power-By header also reveals additional info likes scripting languages or frameworks
    • Probing for Specific Responses: sending specially crafted requests can elicit unique responses that reveal info
    • Analyzing Page Content: can reveal clues about the technologies used
  • tools that automate the fingerprinting process:
    • Wappalyzer
    • BuiltWith
    • WhatWeb
    • Nmap
    • Netcraft
    • wafw00f
  • Web Application Firewalls (WAFs) are security solutions designed to protect web applications from various attacks
1
2
3
4
5
6
7
8
9
10
# -I to fetch only the HTTP headers
curl -I inlanefreight.com

# detect presence of a WAF
wafw00f inlanefreight.com

# nikto is powerful web server scanner.
# vulnerability assessment tool with fingerprinting capabilities
nikto -h inlanefreight.com -Tuning b

Crawling

  • crawlers can be used to extract valuable information like internal and external links, comments, metadata and sensitive files

robots.txt

  • is a text file found in the root directory of a website which contains set of rules for crawlers
  • robots.txt can help us undercover hidden directories, map the website’s structure and detect crawler traps

.Well-Known URIs

  • .well-known , typically accessible via the /.well-known/ path on a web server, centralizes a website’s critical metadata, including configuration files and information related to its services, protocols, and security mechanisms.
  • .well-known can be used to discover endpoints and configuration details
  • it enables us to comprehensively map out a website’s security landscape

Creepy Crawlies

  • popular web crawlers:

    • Burp Suite Spider
    • OWASP ZAP (Zed Attack Proxy)
    • Scrapy (Python Framework)
    • Apache Nutch (Scalable Crawler)
  • using scrapy:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    pip3 install scrapy

    wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
    unzip ReconSpider.zip

    python3 ReconSpider.py http://inlanefreight.com

    # using pipx
    pipx install scrapy
    pipx run --spec scrapy python ReconSpider.py http://inlanefreight.com

Search Engine Discovery

  • Search operators can be used to pinpoint specific types of information
  • Google Dorking is a technique that leverages search operator to uncover sensitive information, security vulnerabilities or hidden content on websites.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    # Finding Login Pages:
    site:example.com inurl:login
    site:example.com (inurl:login OR inurl:admin)

    # Identifying Exposed Files:
    example.com filetype:pdf
    site:example.com (filetype:xls OR filetype:docx)

    # Uncovering Configuration Files
    site:example.com inurl:config.php
    site:example.com (ext:conf OR ext:cnf) #(searches for extensions commonly used for configuration files)

    # Locating Database Backups
    site:example.com inurl:backup
    site:example.com filetype:sql

Web Archives

  • Internet Archive’s Wayback Machine can be used to revisit snapshots of websites as they appeared at various points in their history.
  • It allows us to discover old web pages, directories, files or subdomains that are not currently accessible on current website

Automating Recon

  • frameworks that provide a complete suite of tools for web recon:
  • FinalRecon can be used for tasks like SSL certificate checking, Whois information gathering, header analysis, crawling, and DNS, subdomain and directory enumerations
    1
    2
    3
    4
    5
    6
    7
    8
    9
    # installation
    keepalive@htb[/htb]$ git clone https://github.com/thewhiteh4t/FinalRecon.git
    keepalive@htb[/htb]$ cd FinalRecon
    keepalive@htb[/htb]$ pip3 install -r requirements.txt
    keepalive@htb[/htb]$ chmod +x ./finalrecon.py
    keepalive@htb[/htb]$ ./finalrecon.py --help

    # gather header info and perform a whois lookup
    keepalive@htb[/htb]$ ./finalrecon.py --headers --whois --url http://inlanefreight.com