HackTheBox Academy - Information Gathering - Web Edition

This module equips learners with essential web reconnaissance skills, crucial for ethical hacking and penetration testing. It explores both active and passive techniques, including DNS enumeration, web crawling, analysis of web archives and HTTP headers, and fingerprinting web technologies.

Introduction

The primary goals of web recon:
- Identifying assets
- Discovering hidden information
- Analyzing the attack surface
- Gathering intelligence

Types of Reconnissance

Active Recon

In active recon, the attacker directly interacts with the target system to gather info using the following methods:
- Port scanning
- Vulnerability scanning
- Network mapping
- Banner grabbing
- OS Fingerprinting
- Service enumeration
- Web spidering

Passive

In passive recon, the attacker gathers information about the target without directly interacting with it
This relies on publicly available information and resources, such as:
- Search engine queries
- WHOIS lookups
- DNS
- Web archive analysis
- Social media anaylsis
- Code repositories

WHOIS

WHOIS is a protocol used to query databases that store information about registered internet resources like domain names, IP address blocks and autonomous systems
A WHOIS record contains the following information:
- Domain name
- Registrar
- Registrant contact
- Administrative contact
- Technical contact
- Creation and expiration dates
- Name servers
Using WHOIS:

sudo apt update
sudo apt install whois -y

whois facebook.com

DNS & Subdomains

DNS

DNS translates domain names into numerical IP addresses
The hosts file is used to map hostnames to IP addresses and is located in C:\Windows\System32\drivers\etc\hosts on Windows and in /etc/hosts on Unix
In DNS, a zone is distinct part of the domain namespace; for example, example.com, mail.example.com & blog.example.com all belong to the same DNS zone
A zone file is a text file that resides in the DNS server which defines the resource records within a zone (NS records, MX records, A records, etc.)

Digging DNS

Most popular DNS recon tools:
- dig
- nslookup
- host
- dnsenum
- fierce
- dnsrecon
- theHarvester
- Online DNS Lookup Services

dig:

dig google.com

# IPv4 address
dig domain.com A

# use specific name server for query
dig @1.1.1.1 domain.com

# show full path of resolution
dig +trace domain.com

# reverse look up
dig -x 192.168.1.1

# short answer to the query
dig +short domain.com

# shows answer section only
dig +noall +answer domain.com

# all avaialable DNS records
dig domain.com ANY

Subdomain Bruteforcing

There are several tools that excel at bruteforce enumeration:
- dnsenum
- fierce
- dnsrecon
- amass
- assetfinder
- puredns

dnsenum:

1
2
3

# brutefoce subdomains
# -r enables recursive subdomain brute-forcing; meaninig it will enumerate subdomains of a subdomain it found
dnsenum --enum inlanefreight.com -f /usr/share/seclists/Discovery/DNS/subdomains-top1million-20000.txt -r

DNS Zone Transfers

DNS Zone Transfers is an alternative and less invasive method for uncovering subdomains
It’s a wholesale copy of all DNS records within a zone from one name server to another
1
dig axfr @nsztm1.digi.ninja zonetransfer.me

Virtual Hosts

Virtual hosting allows web servers to differentiate between domains, subdomains or separate websites with distinct content
it allows multiple websites or application to be hosted on a single server
If a vhost doesn’t have DNS record, it still can be accessed by modifying our hosts file to map the domain to an IP address
Virtual host discovery tools:

gobuster:

# --append-domain is required to append base domain to each word
gobuster vhost -u http://<target_IP_address> -w <wordlist_file> --append-domain

# Section Solutions
# search all files that match lines starting with web (^ = start of line) 
# sort -u (-u means unique) to remove duplicates
grep -h ^web /usr/share/wordlists/seclists/Discovery/DNS/* | sort -u > web.txt
gobuster vhost -u http://inlanefreight.htb:30804 -w web.txt --append-domain

Certificate Transparency Logs

There are two popular options for search CT logs:
- crt.sh
- Censys

crt.sh also offers API for automated searches

1 2	`curl -s "https://crt.sh/?q=facebook.com&output=json" \| jq -r '.[] \| select(.name_value \| contains("dev")) \| .name_value' \| sort -u`

Fingerprinting

Techniques used for web server and technology fingerprinting:
- Banner Grabbing: often reveals server software, version numbers and other details
- Analyzing HTTP Headers: they typically disclose the web server software; X-Power-By header also reveals additional info likes scripting languages or frameworks
- Probing for Specific Responses: sending specially crafted requests can elicit unique responses that reveal info
- Analyzing Page Content: can reveal clues about the technologies used
tools that automate the fingerprinting process:
- Wappalyzer
- BuiltWith
- WhatWeb
- Nmap
- Netcraft
- wafw00f
Web Application Firewalls (WAFs) are security solutions designed to protect web applications from various attacks

# -I to fetch only the HTTP headers
curl -I inlanefreight.com

# detect presence of a WAF
wafw00f inlanefreight.com

# nikto is powerful web server scanner.
# vulnerability assessment tool with fingerprinting capabilities
nikto -h inlanefreight.com -Tuning b

Crawling

crawlers can be used to extract valuable information like internal and external links, comments, metadata and sensitive files

robots.txt

is a text file found in the root directory of a website which contains set of rules for crawlers
robots.txt can help us undercover hidden directories, map the website’s structure and detect crawler traps

.Well-Known URIs

.well-known , typically accessible via the /.well-known/ path on a web server, centralizes a website’s critical metadata, including configuration files and information related to its services, protocols, and security mechanisms.
.well-known can be used to discover endpoints and configuration details
it enables us to comprehensively map out a website’s security landscape

Creepy Crawlies

popular web crawlers:
- Burp Suite Spider
- OWASP ZAP (Zed Attack Proxy)
- Scrapy (Python Framework)
- Apache Nutch (Scalable Crawler)

using scrapy:

pip3 install scrapy

wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
unzip ReconSpider.zip

python3 ReconSpider.py http://inlanefreight.com

# using pipx
pipx install scrapy
pipx run --spec scrapy python ReconSpider.py http://inlanefreight.com

Search Engine Discovery

Search operators can be used to pinpoint specific types of information

Google Dorking is a technique that leverages search operator to uncover sensitive information, security vulnerabilities or hidden content on websites.

# Finding Login Pages:
site:example.com inurl:login
site:example.com (inurl:login OR inurl:admin)
  
# Identifying Exposed Files:
example.com filetype:pdf
site:example.com (filetype:xls OR filetype:docx)

# Uncovering Configuration Files
site:example.com inurl:config.php
site:example.com (ext:conf OR ext:cnf) #(searches for extensions commonly used for configuration files)

# Locating Database Backups 
site:example.com inurl:backup
site:example.com filetype:sql

Web Archives

Internet Archive’s Wayback Machine can be used to revisit snapshots of websites as they appeared at various points in their history.
It allows us to discover old web pages, directories, files or subdomains that are not currently accessible on current website

Automating Recon

frameworks that provide a complete suite of tools for web recon:

FinalRecon can be used for tasks like SSL certificate checking, Whois information gathering, header analysis, crawling, and DNS, subdomain and directory enumerations

# installation
keepalive@htb[/htb]$ git clone https://github.com/thewhiteh4t/FinalRecon.git
keepalive@htb[/htb]$ cd FinalRecon
keepalive@htb[/htb]$ pip3 install -r requirements.txt
keepalive@htb[/htb]$ chmod +x ./finalrecon.py
keepalive@htb[/htb]$ ./finalrecon.py --help

# gather header info and perform a whois lookup
keepalive@htb[/htb]$ ./finalrecon.py --headers --whois --url http://inlanefreight.com