Commit graph

15 commits

Author SHA1 Message Date
49aa62815a Allow piping.
Atm you still need to specify some parameter for the url, even if
it's not needed. The good news is that the value doesn't have to
be a valid URL, so any string will do.
2015-09-28 23:37:42 +02:00
943e760ffd Add dump parameters.
Allows to dump both raw and cleaned up html.
2015-09-28 23:24:23 +02:00
00150938dd Fix the html cleaning code that was not really cleaning. 2015-09-28 22:59:09 +02:00
8e517e5de9 Parse options through boost program_options. 2015-09-28 21:48:46 +02:00
4f85fa01a9 Update libtidy and curlcpp. 2015-09-28 15:30:09 +02:00
44992458ac Quick dirty fix to avoid invalid characters in scripts.
Note that with this change scritps are stripped away, so
you won't find any <script></script> pair in the html.

Also print some more detailed info about errors.
2015-03-01 05:03:12 +01:00
3bfea89568 Drop tidy from the repo and import it as submodule. 2015-03-01 03:17:47 +01:00
0e077a4930 Refactoring to put html retrieval & cleaning into a separate file.
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
cb00e484fa Working example.
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
aa015ddd6a Working example.
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
c9de3d3389 Updating to tidy-html5.
See http://w3c.github.io/tidy-html5/
and https://github.com/w3c/tidy-html5.
2014-06-06 22:33:27 +02:00
e2d74fd092 Trying to use libtidy but it throws. 2014-06-06 22:22:12 +02:00
56f0736d1a Move headers into tidy/ subdirectory. 2014-06-06 21:34:01 +02:00
3182e098bb Import of libtidy with custom cmake file. 2014-06-06 21:18:30 +02:00
f213ce5411 First import 2014-06-06 20:24:24 +02:00