SecretScraper是一个高度可配置的网页爬虫工具,从目标网站抓取链接,并通过正则表达式抓取敏感数据。
使用说明
Python 版本 >= 3.9
特点
-
网络爬虫:通过 DOM 层次结构和正则表达式提取链接
-
支持域名白名单和黑名单
-
支持多个目标,从文件输入目标网址
-
支持本地文件扫描
-
可扩展的自定义:标头、代理、超时、cookie、抓取深度、跟随重定向等
-
内置正则表达式,用于搜索敏感信息
-
以 yaml 格式灵活配置
安装 安装:
*
pip install secretscraper
更新: *
pip install --upgrade secretscraper
基本用法 单一目标:
*
secretscraper -u https://xxxxxxx.com/
多个目标: *
secretscraper -f urls
# urlshttp://xxxxxxx.com/1http://xxxxxxx.com/2http://xxxxxxx.com/3http://xxxxxxx.com/4
示例输出
所有支持的选项:* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> secretscraper --helpUsage: secretscraper [OPTIONS]
Main commands
Options: -V, --version Show version and exit. --debug Enable debug. -a, --ua TEXT Set User-Agent -c, --cookie TEXT Set cookie -d, --allow-domains TEXT Domain white list, wildcard(*) is supported, separated by commas, e.g. *.example.com, example* -D, --disallow-domains TEXT Domain black list, wildcard(*) is supported, separated by commas, e.g. *.example.com, example* -f, --url-file FILE Target urls file, separated by line break -i, --config FILE Set config file, defaults to settings.yml -m, --mode [1|2] Set crawl mode, 1(normal) for max_depth=1, 2(thorough) for max_depth=2, default 1 --max-page INTEGER Max page number to crawl, default 100000 --max-depth INTEGER Max depth to crawl, default 1 -o, --outfile FILE Output result to specified file in csv format -s, --status TEXT Filter response status to display, seperated by commas, e.g. 200,300-400 -x, --proxy TEXT Set proxy, e.g. http://127.0.0.1:8080, socks5://127.0.0.1:7890 -H, --hide-regex Hide regex search result -F, --follow-redirects Follow redirects -u, --url TEXT Target url --detail Show detailed result --validate Validate the status of found urls -l, --local PATH Local file or directory, scan local file/directory recursively --help Show this message and exit.
自定义
内置配置如下图所示。您可以通过 分配自定义配置。-i settings.yml * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
verbose: falsedebug: falseloglevel: criticallogpath: loghandler_type: re
proxy: "" # http://127.0.0.1:7890max_depth: 1 # 0 for no limitmax_page_num: 1000 # 0 for no limittimeout: 5follow_redirects: trueworkers_num: 1000headers: Accept: "*/*" Cookie: "" User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0
urlFind: - "[\"''"`]\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250}?)\\s{0,6}[\"''"`]" - "=\\s{0,6}(https{0,1}:[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})" - "[\"''"`]\\s{0,6}([#,.]{0,2}/[-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250}?)\\s{0,6}[\"''"`]" - "\"([-a-zA-Z0-9()@:%_\\+.~#?&//={}]+?[/]{1}[-a-zA-Z0-9()@:%_\\+.~#?&//={}]+?)\"" - "href\\s{0,6}=\\s{0,6}[\"''"`]{0,1}\\s{0,6}([-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})|action\\s{0,6}=\\s{0,6}[\"''"`]{0,1}\\s{0,6}([-a-zA-Z0-9()@:%_\\+.~#?&//={}]{2,250})"jsFind: - (https{0,1}:[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js) - '["'''"`]\s{0,6}(/{0,1}[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)' - =\s{0,6}[",',',"]{0,1}\s{0,6}(/{0,1}[-a-zA-Z0-9()@:%_\+.~#?&//=]{2,100}?[-a-zA-Z0-9()@:%_\+.~#?&//=]{3}[.]js)
dangerousPath: - logout - update - remove - insert - delete
rules: - name: Swagger regex: \b[\w/]+?((swagger-ui.html)|(\"swagger\":)|(Swagger UI)|(swaggerUi)|(swaggerVersion))\b loaded: true - name: ID Card regex: \b((\d{8}(0\d|10|11|12)([0-2]\d|30|31)\d{3})|(\d{6}(18|19|20)\d{2}(0[1-9]|10|11|12)([0-2]\d|30|31)\d{3}(\d|X|x)))\b loaded: true - name: Phone regex: "['\"](1(3([0-35-9]\\d|4[1-8])|4[14-9]\\d|5([\\d]\\d|7[1-79])|66\\d|7[2-35-8]\\d|8\\d{2}|9[89]\\d)\\d{7})['\"]" loaded: true - name: JS Map regex: \b([\w/]+?\.js\.map) loaded: true - name: URL as a Value regex: (\b\w+?=(https?)(://|%3a%2f%2f)) loaded: false - name: Email regex: "['\"]([\\w]+(?:\\.[\\w]+)*@(?:[\\w](?:[\\w-]*[\\w])?\\.)+[\\w](?:[\\w-]*[\\w])?)['\"]" loaded: true - name: Internal IP regex: '[^0-9]((127\.0\.0\.1)|(10\.\d{1,3}\.\d{1,3}\.\d{1,3})|(172\.((1[6-9])|(2\d)|(3[01]))\.\d{1,3}\.\d{1,3})|(192\.168\.\d{1,3}\.\d{1,3}))' loaded: true - name: Cloud Key regex: \b((accesskeyid)|(accesskeysecret)|\b(LTAI[a-z0-9]{12,20}))\b loaded: true - name: Shiro regex: (=deleteMe|rememberMe=) loaded: true - name: Suspicious API Key regex: "[\"'][0-9a-zA-Z]{32}['\"]" loaded: true - name: Jwt regex: "['\"](ey[A-Za-z0-9_-]{10,}\\.[A-Za-z0-9._-]{10,}|ey[A-Za-z0-9_\\/+-]{10,}\\.[A-Za-z0-9._\\/+-]{10,})['\"]" loaded: true
项目地址-star*
https://github.com/PadishahIII/SecretScraper