bdd50d2267
Refactor xpath query into a separate function.
2015-10-01 14:18:02 +02:00
c9db1d8ba3
Wrap the unique_ptr so that dtor is called from the cpp.
...
This make it unnecessary to include scrapast.hpp in whatever
cpp takes ownership of the unique_ptr. Without this tho,
the destructor of unique_ptr would force you to include scrapast.
2015-10-01 01:45:42 +02:00
dfd0ec343e
Implement parsing of scraplang.
2015-10-01 01:32:27 +02:00
7dfd1f4a38
Update tidy (fixes the build with tidy as a submodule)
2015-09-30 16:33:52 +02:00
bf3b85498b
Add an option to customize the user agent at runtime.
2015-09-30 01:27:28 +02:00
c947eab83f
Show some readable message when being passed an unknown option.
2015-09-30 01:14:47 +02:00
05af365c58
Move command line parsing code to a new file.
2015-09-30 01:13:48 +02:00
c304ffbbf0
Don't detect if it's a tty - only read from stdin when url is -
2015-09-29 21:04:28 +02:00
db1311839d
Check fstats instead of using isatty().
2015-09-29 17:40:01 +02:00
c69252604c
Default to static tidy-html5, but let the user configure this.
2015-09-28 23:44:11 +02:00
49aa62815a
Allow piping.
...
Atm you still need to specify some parameter for the url, even if
it's not needed. The good news is that the value doesn't have to
be a valid URL, so any string will do.
2015-09-28 23:37:42 +02:00
943e760ffd
Add dump parameters.
...
Allows to dump both raw and cleaned up html.
2015-09-28 23:24:23 +02:00
00150938dd
Fix the html cleaning code that was not really cleaning.
2015-09-28 22:59:09 +02:00
8e517e5de9
Parse options through boost program_options.
2015-09-28 21:48:46 +02:00
4f85fa01a9
Update libtidy and curlcpp.
2015-09-28 15:30:09 +02:00
44992458ac
Quick dirty fix to avoid invalid characters in scripts.
...
Note that with this change scritps are stripped away, so
you won't find any <script></script> pair in the html.
Also print some more detailed info about errors.
2015-03-01 05:03:12 +01:00
3bfea89568
Drop tidy from the repo and import it as submodule.
2015-03-01 03:17:47 +01:00
0e077a4930
Refactoring to put html retrieval & cleaning into a separate file.
...
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
cb00e484fa
Working example.
...
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
aa015ddd6a
Working example.
...
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
c9de3d3389
Updating to tidy-html5.
...
See http://w3c.github.io/tidy-html5/
and https://github.com/w3c/tidy-html5 .
2014-06-06 22:33:27 +02:00
e2d74fd092
Trying to use libtidy but it throws.
2014-06-06 22:22:12 +02:00
56f0736d1a
Move headers into tidy/ subdirectory.
2014-06-06 21:34:01 +02:00
3182e098bb
Import of libtidy with custom cmake file.
2014-06-06 21:18:30 +02:00
f213ce5411
First import
2014-06-06 20:24:24 +02:00