Skip to content

File Scraping

  1. Setup web browser to proxy traffic through Burp Suite.
  2. Search for documents with a search engine and capture the responses in Burp. Click through the pages of search results to capture more responses.
  3. Use the Burp Extension Logger++ to Grep for the file URLs from the search result responses.
  4. Cycle through the grepped URLs from the captured search results:
Loop through all the URLs from the Google search and download them.
for i in $(cat grepped_urls.lst); do wget $i -P ./downloaded_files/;

Interesting File Extensions

  • .pdf
  • .docx/.doc
  • .xlsx/.xls
  • .pptx/.ppt
  • .csv
  • .zip/.7z/.rar
  • .eml/.msg
  • .pst/.ost
  • .sql
  • .tif/.jpg/.png

Google

regex to extract search result URLs from Google web responses
(?<=\bhref=")[^"]+(?="\s+data-ved\b)

File Scraping

Search for a file extension under a specified domain
site:example.com filetype:<file_extension>
Search for multiple file extensions under a specified domain.
site:example.com (ext:pdf OR ext:docx OR ext:doc OR ext:xlsx OR ext:xls OR ext:pptx OR ext:ppt OR ext:csv OR ext:zip OR ext:7z OR ext:rar OR ext:eml OR ext:msg OR ext:pst OR ext:ost OR ext:sql OR ext:tif OR ext:tiff OR ext:jpg OR ext:jpeg OR ext:png)

Dorks

Operator What it does Status (2025‑09)
"<exact phrase>" Match an exact phrase. Supported
word1 OR word2 Results containing either term (use uppercase OR). Supported
-term Exclude a word/phrase. Supported
(term1 OR term2) term3 Parentheses don’t reliably control precedence in Web Search. Prefer term1 term3 OR term2 term3. Correction
"term * term" “Fill‑in‑the‑blank” wildcard inside quotes; * stands for one or more words. Works (undocumented)
site:example.com Restrict results to a domain (includes subdomains). Supported
site:*.example.com Unnecessary. Use site:example.com to include subdomains. Correction
site:.gov Restrict to a top‑level domain (TLD). Supported
filetype:pdf Only files of a specific type. Supported
ext:pdf Alias of filetype:. Works (undocumented)
inurl:login Term must appear in the URL. Works (undocumented)
intitle:"annual report" Term/phrase must be in the title. Works (undocumented)
allintitle: term1 term2 All listed terms must be in the title. Works (undocumented)
intext:"error code 500" Term/phrase must appear in page body text. Works (undocumented)
allintext: term1 term2 All listed terms must be in the body text. Works (undocumented)
cache:example.com/page View Google’s cached copy of a page. Removed
related:example.com Pages Google considers similar to a URL/domain. Removed
info:example.com Info panel for a URL/domain. Deprecated (don’t rely on it)
define:term Dictionary/definition results for a term. Supported
after:2024-01-01 Results after a specific date (YYYY‑MM‑DD). Supported
before:2024-12-31 Results before a specific date. Supported
2019..2024 Numeric range (years, prices, etc.). Works (undocumented)
"term1" AROUND(3) "term2" Proximity search; terms near each other (AROUND must be uppercase). Works (undocumented / inconsistent)
source:Reuters Filter by news source. Google News only. Supported (News only)
imagesize:1920x1080 Exact image size. Google Images only. Supported (Images only)
site:example.com inurl:/docs/ Combine domain + URL path filter. Supported
site:example.com filetype:pdf Combine domain + file type filter. Supported
"error message" site:stackoverflow.com Phrase + site filter. Supported
"report" (quarterly OR annual) Use ("report" quarterly) OR ("report" annual); parentheses don’t enforce precedence. Correction
-site:example.com Exclude a domain from results. Supported
-intitle:draft Exclude pages with a term in the title. Works (undocumented)

Bing

regex to extract search result URLs from Google web responses
??

File Scraping

Search for a file extension under a specified domain
site:example.com filetype:<file_extension>
Search for multiple file extensions under a specified domain.
site:example.com (filetype:pdf OR filetype:docx OR filetype:doc OR filetype:xlsx OR filetype:xls OR filetype:pptx OR filetype:ppt OR filetype:csv OR filetype:zip OR filetype:7z OR filetype:rar OR filetype:eml OR filetype:msg OR filetype:pst OR filetype:ost OR filetype:sql OR filetype:tif OR filetype:tiff OR filetype:jpg OR filetype:jpeg OR filetype:png)

Dorks

Dork Description
"exact phrase" Match an exact phrase.
word1 OR word2 Results containing either term (OR must be uppercase).
word1 AND word2 Results containing both terms.
-term Exclude a word or phrase (alias: NOT).
(term1 OR term2) term3 Group terms to control precedence.
site:example.com Restrict results to a domain or a directory (≤2 levels deep).
site:.gov Restrict to a top‑level domain (TLD).
filetype:pdf Only files of a specific type.
ext:pdf Only pages with that filename extension.
contains:pdf Pages that link to files of that type.
intitle:term Term in the page title (single word; chain multiples).
inbody:term Term in page body text (single word; chain multiples).
inanchor:term Term in anchor text (single word; chain multiples).
ip:203.0.113.10 Sites hosted on the specified IPv4 address.
language:en Restrict to a language (language code).
loc:US Restrict to a country/region (alias: location:).
prefer:term Emphasize a term to bias ranking.
url:example.com/page Check whether a domain or full URL is indexed by Bing.
feed:term Find RSS/Atom feeds.
hasfeed:term Pages that contain an RSS/Atom feed.