类似于grep的web页面工具,具有js deobfocusation和易扩展性等附加功能
webgrep-tool的Python项目详细描述
目录
- Introduction
- System Requirements
- Installation
- Quick Start
- Design Principles
- Resource Handlers
- Issues management
简介
这个自包含的工具依赖于著名的grep
工具来重新映射网页。它几乎绑定了原始工具的每一个选项,还提供了一些附加功能,比如在重新映射下载的资源之前在图像上去除javascript或appyling ocr。
系统要求
这个脚本在Ubuntu16.04上用Python2.7和Python3.5进行了测试。
它的python逻辑主要使用标准的内置模块,但也使用一些特定的工具或与预处理器相关的模块。它调用grep
。
安装
$ sudo pip install webgrep-tool
Behind a proxy ?
Do not forget to add option
--proxy=http://[user]:[pwd]@[host]:[port]
to your pip command.
快速启动
- 帮助
$ webgrep --help usage: webgrep [OPTION]... PATTERN [URL]... Search for PATTERN in each input URL and its related resources (images, scripts and style sheets). By default, - resources are NOT downloaded - response HTTP headers are NOT included in grepping ; use '--include-headers' - PATTERN is a basic regular expression (BRE) ; use '-E' for extended (ERE) Important note: webgrep does not handle recursion (in other words, it does not spider additional web pages). Examples: webgrep example http://www.example.com # will only grep on HTML code webgrep -r example http://www.example.com # will only grep on LOCAL images, ... webgrep -R example http://www.example.com # will only grep on ALL images, ... Regexp selection and interpretation: -e REGEXP, --regexp REGEXP use PATTERN for matching -f FILE, --file FILE obtain PATTERN from FILE -E, --extended-regexp PATTERN is an extended regular expression (ERE) -F, --fixed-strings PATTERN is a set of newline-separated fixed strings -G, --basic-regexp PATTERN is a basic regular expression (BRE) -P, --perl-regexp PATTERN is a Perl regular expression -i, --ignore-case ignore case distinctions -w, --word-regexp force PATTERN to match only whole words -x, --line-regexp force PATTERN to match only whole lines -z, --null-data a data line ends in 0 byte, not newline Miscellaneous: -s, --no-messages suppress error messages -v, --invert-match select non-matching lines -V, --version print version information and exit --help display this help and exit --verbose verbose mode --keep-files keep temporary files in the temporary directory --temp-dir TMP define the temporary directory (default: /tmp/webgrep) Output control: -m NUM, --max-count NUM stop after NUM matches -b, --byte-offset print the byte offset with output lines -n, --line-number print line number with output lines --line-buffered flush output on every line -H, --with-filename print the file name for each match -h, --no-filename suppress the file name prefix on output --label LABEL use LABEL as the standard input filename prefix -o, --only-matching show only the part of a line matching PATTERN -q, --quiet, --silent suppress all normal output --binary-files TYPE assume that binary files are TYPE; TYPE is 'binary', 'text', or 'without-match' -a, --text equivalent to --binary-files=text -I equivalent to --binary-files=without-match -L, --files-without-match print only names of FILEs containing no match -l, --files-with-match print only names of FILEs containing matches -c, --count print only a count of matching lines per FILE -T, --initial-tab make tabs line up (if needed) -Z, --null print 0 byte after FILE name Context control: -B NUM, --before-context NUM print NUM lines of leading context -A NUM, --after-context NUM print NUM lines of trailing context -C NUM, --context NUM print NUM lines of output context Web options: -r, --local-resources also grep local resources (same-origin) -R, --all-resources also grep all resources (even non-same-origin) --include-headers also grep HTTP headers --cookie COOKIE use a session cookie in the HTTP headers --referer REFERER provide the referer in the HTTP headers Proxy settings (by default, system proxy settings are used): -d, --disable-proxy manually disable proxy --http-proxy HTTP manually set the HTTP proxy --https-proxy HTTPS manually set the HTTPS proxy Please report bugs on GitHub: https://github.com/dhondta/webgrep
- 示例
$ ./webgrep -R Welcome https://github.com Welcome home, <br>developers
设计原则:
- 非标准进口;
如果未安装触发器退出,并显示安装这些 的命令
- 如果未安装,请不要触发退出,显示安装这些命令并继续执行而不使用相关函数
/usr/bin
中复制它,而依赖项不是非标准导入。资源处理程序
定义:
- resource(正在处理的内容):网页、图像、javascript、css
- handler(如何处理资源):css未统一、ocr、去模糊、exif数据检索,…
处理程序在代码的# --...-- HANDLERS SECTION --...--
中定义。当前可用的处理程序:
- 图像
- exif:使用
exiftool
- 隐写术:使用
steghide
(密码为空) - 字符串:使用
strings
- ocr:使用
tesseract
- 脚本
- javascript美化和除臭:使用
jsbeautifier
- 样式
- 未统一:使用正则表达式
注意:css文件中的图像也会被处理。
问题管理
如果你想贡献或提交建议,请open an Issue。
如果要生成并提交新处理程序,请打开一个拉取请求。