King_DuckZ
3572803f66
Allow nesting of structs.
2018-01-16 10:36:53 +00:00
King_DuckZ
fcb25ed456
WiP reworking the AST interpreter.
2018-01-13 18:16:11 +00:00
King_DuckZ
29f8fe299e
Still trying to get the AST interpreted without much luck.
...
I may be going down the wrong path so I'm committing and
scrapping everything to try a different approach.
2018-01-13 02:03:01 +00:00
King_DuckZ
f0e7a1d136
Trying to get scraplang implemented
...
Lots of changes I made on the train and had little
time to make tidily.
Use c++17 (for std::optional)
Clean up the cmake script a bit
Get rid of unused stuff
Skeleton implementation of some classes for scraplang
2018-01-10 20:25:19 +00:00
King_DuckZ
c31d317d51
Print a notice in the --help view in debug builds.
2015-10-01 15:37:09 +02:00
King_DuckZ
41b0f59039
Bump version to 0.2.1b
2015-10-01 15:32:30 +02:00
King_DuckZ
5d0a895978
Attach GPLv3.
2015-10-01 15:31:58 +02:00
King_DuckZ
bdd50d2267
Refactor xpath query into a separate function.
2015-10-01 14:18:02 +02:00
King_DuckZ
c9db1d8ba3
Wrap the unique_ptr so that dtor is called from the cpp.
...
This make it unnecessary to include scrapast.hpp in whatever
cpp takes ownership of the unique_ptr. Without this tho,
the destructor of unique_ptr would force you to include scrapast.
2015-10-01 01:45:42 +02:00
King_DuckZ
dfd0ec343e
Implement parsing of scraplang.
2015-10-01 01:32:27 +02:00
King_DuckZ
7dfd1f4a38
Update tidy (fixes the build with tidy as a submodule)
2015-09-30 16:33:52 +02:00
King_DuckZ
bf3b85498b
Add an option to customize the user agent at runtime.
2015-09-30 01:27:28 +02:00
King_DuckZ
c947eab83f
Show some readable message when being passed an unknown option.
2015-09-30 01:14:47 +02:00
King_DuckZ
05af365c58
Move command line parsing code to a new file.
2015-09-30 01:13:48 +02:00
King_DuckZ
c304ffbbf0
Don't detect if it's a tty - only read from stdin when url is -
2015-09-29 21:04:28 +02:00
King_DuckZ
db1311839d
Check fstats instead of using isatty().
2015-09-29 17:40:01 +02:00
King_DuckZ
c69252604c
Default to static tidy-html5, but let the user configure this.
2015-09-28 23:44:11 +02:00
King_DuckZ
49aa62815a
Allow piping.
...
Atm you still need to specify some parameter for the url, even if
it's not needed. The good news is that the value doesn't have to
be a valid URL, so any string will do.
2015-09-28 23:37:42 +02:00
King_DuckZ
943e760ffd
Add dump parameters.
...
Allows to dump both raw and cleaned up html.
2015-09-28 23:24:23 +02:00
King_DuckZ
00150938dd
Fix the html cleaning code that was not really cleaning.
2015-09-28 22:59:09 +02:00
King_DuckZ
8e517e5de9
Parse options through boost program_options.
2015-09-28 21:48:46 +02:00
King_DuckZ
4f85fa01a9
Update libtidy and curlcpp.
2015-09-28 15:30:09 +02:00
King_DuckZ
44992458ac
Quick dirty fix to avoid invalid characters in scripts.
...
Note that with this change scritps are stripped away, so
you won't find any <script></script> pair in the html.
Also print some more detailed info about errors.
2015-03-01 05:03:12 +01:00
King_DuckZ
3bfea89568
Drop tidy from the repo and import it as submodule.
2015-03-01 03:17:47 +01:00
King_DuckZ
0e077a4930
Refactoring to put html retrieval & cleaning into a separate file.
...
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
King_DuckZ
cb00e484fa
Working example.
...
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
King_DuckZ
aa015ddd6a
Working example.
...
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
King_DuckZ
c9de3d3389
Updating to tidy-html5.
...
See http://w3c.github.io/tidy-html5/
and https://github.com/w3c/tidy-html5 .
2014-06-06 22:33:27 +02:00
King_DuckZ
e2d74fd092
Trying to use libtidy but it throws.
2014-06-06 22:22:12 +02:00
King_DuckZ
56f0736d1a
Move headers into tidy/ subdirectory.
2014-06-06 21:34:01 +02:00
King_DuckZ
3182e098bb
Import of libtidy with custom cmake file.
2014-06-06 21:18:30 +02:00
King_DuckZ
f213ce5411
First import
2014-06-06 20:24:24 +02:00