File Scraping¶
- Setup web browser to proxy traffic through Burp Suite.
- Search for documents with a search engine and capture the responses in Burp. Click through the pages of search results to capture more responses.
- Use the Burp Extension Logger++ to Grep for the file URLs from the search result responses.
- Cycle through the grepped URLs from the captured search results:
Loop through all the URLs from the Google search and download them.
for i in $(cat grepped_urls.lst); do wget $i -P ./downloaded_files/;
Interesting File Extensions¶
- .docx/.doc
- .xlsx/.xls
- .pptx/.ppt
- .csv
- .zip/.7z/.rar
- .eml/.msg
- .pst/.ost
- .sql
- .tif/.jpg/.png
Google¶
File Scraping¶
Search for multiple file extensions under a specified domain.
site:example.com (ext:pdf OR ext:docx OR ext:doc OR ext:xlsx OR ext:xls OR ext:pptx OR ext:ppt OR ext:csv OR ext:zip OR ext:7z OR ext:rar OR ext:eml OR ext:msg OR ext:pst OR ext:ost OR ext:sql OR ext:tif OR ext:tiff OR ext:jpg OR ext:jpeg OR ext:png)
Dorks¶
Operator | What it does | Status (2025‑09) |
---|---|---|
"<exact phrase>" |
Match an exact phrase. | Supported |
word1 OR word2 |
Results containing either term (use uppercase OR). | Supported |
-term |
Exclude a word/phrase. | Supported |
(term1 OR term2) term3 |
Parentheses don’t reliably control precedence in Web Search. Prefer term1 term3 OR term2 term3 . |
Correction |
"term * term" |
“Fill‑in‑the‑blank” wildcard inside quotes; * stands for one or more words. |
Works (undocumented) |
site:example.com |
Restrict results to a domain (includes subdomains). | Supported |
site:*.example.com |
Unnecessary. Use site:example.com to include subdomains. |
Correction |
site:.gov |
Restrict to a top‑level domain (TLD). | Supported |
filetype:pdf |
Only files of a specific type. | Supported |
ext:pdf |
Alias of filetype: . |
Works (undocumented) |
inurl:login |
Term must appear in the URL. | Works (undocumented) |
intitle:"annual report" |
Term/phrase must be in the title. | Works (undocumented) |
allintitle: term1 term2 |
All listed terms must be in the title. | Works (undocumented) |
intext:"error code 500" |
Term/phrase must appear in page body text. | Works (undocumented) |
allintext: term1 term2 |
All listed terms must be in the body text. | Works (undocumented) |
cache:example.com/page |
View Google’s cached copy of a page. | Removed |
related:example.com |
Pages Google considers similar to a URL/domain. | Removed |
info:example.com |
Info panel for a URL/domain. | Deprecated (don’t rely on it) |
define:term |
Dictionary/definition results for a term. | Supported |
after:2024-01-01 |
Results after a specific date (YYYY‑MM‑DD). | Supported |
before:2024-12-31 |
Results before a specific date. | Supported |
2019..2024 |
Numeric range (years, prices, etc.). | Works (undocumented) |
"term1" AROUND(3) "term2" |
Proximity search; terms near each other (AROUND must be uppercase). | Works (undocumented / inconsistent) |
source:Reuters |
Filter by news source. Google News only. | Supported (News only) |
imagesize:1920x1080 |
Exact image size. Google Images only. | Supported (Images only) |
site:example.com inurl:/docs/ |
Combine domain + URL path filter. | Supported |
site:example.com filetype:pdf |
Combine domain + file type filter. | Supported |
"error message" site:stackoverflow.com |
Phrase + site filter. | Supported |
"report" (quarterly OR annual) |
Use ("report" quarterly) OR ("report" annual) ; parentheses don’t enforce precedence. |
Correction |
-site:example.com |
Exclude a domain from results. | Supported |
-intitle:draft |
Exclude pages with a term in the title. | Works (undocumented) |
Bing¶
File Scraping¶
Search for multiple file extensions under a specified domain.
site:example.com (filetype:pdf OR filetype:docx OR filetype:doc OR filetype:xlsx OR filetype:xls OR filetype:pptx OR filetype:ppt OR filetype:csv OR filetype:zip OR filetype:7z OR filetype:rar OR filetype:eml OR filetype:msg OR filetype:pst OR filetype:ost OR filetype:sql OR filetype:tif OR filetype:tiff OR filetype:jpg OR filetype:jpeg OR filetype:png)
Dorks¶
Dork | Description |
---|---|
"exact phrase" | Match an exact phrase. |
word1 OR word2 | Results containing either term (OR must be uppercase). |
word1 AND word2 | Results containing both terms. |
-term | Exclude a word or phrase (alias: NOT). |
(term1 OR term2) term3 | Group terms to control precedence. |
site:example.com | Restrict results to a domain or a directory (≤2 levels deep). |
site:.gov | Restrict to a top‑level domain (TLD). |
filetype:pdf | Only files of a specific type. |
ext:pdf | Only pages with that filename extension. |
contains:pdf | Pages that link to files of that type. |
intitle:term | Term in the page title (single word; chain multiples). |
inbody:term | Term in page body text (single word; chain multiples). |
inanchor:term | Term in anchor text (single word; chain multiples). |
ip:203.0.113.10 | Sites hosted on the specified IPv4 address. |
language:en | Restrict to a language (language code). |
loc:US | Restrict to a country/region (alias: location:). |
prefer:term | Emphasize a term to bias ranking. |
url:example.com/page | Check whether a domain or full URL is indexed by Bing. |
feed:term | Find RSS/Atom feeds. |
hasfeed:term | Pages that contain an RSS/Atom feed. |