Commit Graph

28 Commits

Author SHA1 Message Date
King_DuckZ c31d317d51 Print a notice in the --help view in debug builds. 2015-10-01 15:37:09 +02:00
King_DuckZ 41b0f59039 Bump version to 0.2.1b 2015-10-01 15:32:30 +02:00
King_DuckZ 5d0a895978 Attach GPLv3. 2015-10-01 15:31:58 +02:00
King_DuckZ bdd50d2267 Refactor xpath query into a separate function. 2015-10-01 14:18:02 +02:00
King_DuckZ c9db1d8ba3 Wrap the unique_ptr so that dtor is called from the cpp.
This make it unnecessary to include scrapast.hpp in whatever
cpp takes ownership of the unique_ptr. Without this tho,
the destructor of unique_ptr would force you to include scrapast.
2015-10-01 01:45:42 +02:00
King_DuckZ dfd0ec343e Implement parsing of scraplang. 2015-10-01 01:32:27 +02:00
King_DuckZ 7dfd1f4a38 Update tidy (fixes the build with tidy as a submodule) 2015-09-30 16:33:52 +02:00
King_DuckZ bf3b85498b Add an option to customize the user agent at runtime. 2015-09-30 01:27:28 +02:00
King_DuckZ c947eab83f Show some readable message when being passed an unknown option. 2015-09-30 01:14:47 +02:00
King_DuckZ 05af365c58 Move command line parsing code to a new file. 2015-09-30 01:13:48 +02:00
King_DuckZ c304ffbbf0 Don't detect if it's a tty - only read from stdin when url is - 2015-09-29 21:04:28 +02:00
King_DuckZ db1311839d Check fstats instead of using isatty(). 2015-09-29 17:40:01 +02:00
King_DuckZ c69252604c Default to static tidy-html5, but let the user configure this. 2015-09-28 23:44:11 +02:00
King_DuckZ 49aa62815a Allow piping.
Atm you still need to specify some parameter for the url, even if
it's not needed. The good news is that the value doesn't have to
be a valid URL, so any string will do.
2015-09-28 23:37:42 +02:00
King_DuckZ 943e760ffd Add dump parameters.
Allows to dump both raw and cleaned up html.
2015-09-28 23:24:23 +02:00
King_DuckZ 00150938dd Fix the html cleaning code that was not really cleaning. 2015-09-28 22:59:09 +02:00
King_DuckZ 8e517e5de9 Parse options through boost program_options. 2015-09-28 21:48:46 +02:00
King_DuckZ 4f85fa01a9 Update libtidy and curlcpp. 2015-09-28 15:30:09 +02:00
King_DuckZ 44992458ac Quick dirty fix to avoid invalid characters in scripts.
Note that with this change scritps are stripped away, so
you won't find any <script></script> pair in the html.

Also print some more detailed info about errors.
2015-03-01 05:03:12 +01:00
King_DuckZ 3bfea89568 Drop tidy from the repo and import it as submodule. 2015-03-01 03:17:47 +01:00
King_DuckZ 0e077a4930 Refactoring to put html retrieval & cleaning into a separate file.
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
King_DuckZ cb00e484fa Working example.
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
King_DuckZ aa015ddd6a Working example.
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
King_DuckZ c9de3d3389 Updating to tidy-html5.
See http://w3c.github.io/tidy-html5/
and https://github.com/w3c/tidy-html5.
2014-06-06 22:33:27 +02:00
King_DuckZ e2d74fd092 Trying to use libtidy but it throws. 2014-06-06 22:22:12 +02:00
King_DuckZ 56f0736d1a Move headers into tidy/ subdirectory. 2014-06-06 21:34:01 +02:00
King_DuckZ 3182e098bb Import of libtidy with custom cmake file. 2014-06-06 21:18:30 +02:00
King_DuckZ f213ce5411 First import 2014-06-06 20:24:24 +02:00