44992458ac
Quick dirty fix to avoid invalid characters in scripts.
...
Note that with this change scritps are stripped away, so
you won't find any <script></script> pair in the html.
Also print some more detailed info about errors.
2015-03-01 05:03:12 +01:00
3bfea89568
Drop tidy from the repo and import it as submodule.
2015-03-01 03:17:47 +01:00
0e077a4930
Refactoring to put html retrieval & cleaning into a separate file.
...
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
cb00e484fa
Working example.
...
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
aa015ddd6a
Working example.
...
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
c9de3d3389
Updating to tidy-html5.
...
See http://w3c.github.io/tidy-html5/
and https://github.com/w3c/tidy-html5 .
2014-06-06 22:33:27 +02:00
e2d74fd092
Trying to use libtidy but it throws.
2014-06-06 22:22:12 +02:00
56f0736d1a
Move headers into tidy/ subdirectory.
2014-06-06 21:34:01 +02:00
3182e098bb
Import of libtidy with custom cmake file.
2014-06-06 21:18:30 +02:00
f213ce5411
First import
2014-06-06 20:24:24 +02:00