Commit graph

35 commits

Author SHA1 Message Date
41bb315b02 Store the same tree for ApplyBlocks too. 2018-01-17 01:29:46 +00:00
2fd4daf52c Keep FromBlock data in tree form. 2018-01-16 19:50:35 +00:00
26b912d66c Allow dots in scraplang identifiers. 2018-01-16 10:42:25 +00:00
3572803f66 Allow nesting of structs. 2018-01-16 10:36:53 +00:00
fcb25ed456 WiP reworking the AST interpreter. 2018-01-13 18:16:11 +00:00
29f8fe299e Still trying to get the AST interpreted without much luck.
I may be going down the wrong path so I'm committing and
scrapping everything to try a different approach.
2018-01-13 02:03:01 +00:00
f0e7a1d136 Trying to get scraplang implemented
Lots of changes I made on the train and had little
time to make tidily.
Use c++17 (for std::optional)
Clean up the cmake script a bit
Get rid of unused stuff
Skeleton implementation of some classes for scraplang
2018-01-10 20:25:19 +00:00
c31d317d51 Print a notice in the --help view in debug builds. 2015-10-01 15:37:09 +02:00
41b0f59039 Bump version to 0.2.1b 2015-10-01 15:32:30 +02:00
5d0a895978 Attach GPLv3. 2015-10-01 15:31:58 +02:00
bdd50d2267 Refactor xpath query into a separate function. 2015-10-01 14:18:02 +02:00
c9db1d8ba3 Wrap the unique_ptr so that dtor is called from the cpp.
This make it unnecessary to include scrapast.hpp in whatever
cpp takes ownership of the unique_ptr. Without this tho,
the destructor of unique_ptr would force you to include scrapast.
2015-10-01 01:45:42 +02:00
dfd0ec343e Implement parsing of scraplang. 2015-10-01 01:32:27 +02:00
7dfd1f4a38 Update tidy (fixes the build with tidy as a submodule) 2015-09-30 16:33:52 +02:00
bf3b85498b Add an option to customize the user agent at runtime. 2015-09-30 01:27:28 +02:00
c947eab83f Show some readable message when being passed an unknown option. 2015-09-30 01:14:47 +02:00
05af365c58 Move command line parsing code to a new file. 2015-09-30 01:13:48 +02:00
c304ffbbf0 Don't detect if it's a tty - only read from stdin when url is - 2015-09-29 21:04:28 +02:00
db1311839d Check fstats instead of using isatty(). 2015-09-29 17:40:01 +02:00
c69252604c Default to static tidy-html5, but let the user configure this. 2015-09-28 23:44:11 +02:00
49aa62815a Allow piping.
Atm you still need to specify some parameter for the url, even if
it's not needed. The good news is that the value doesn't have to
be a valid URL, so any string will do.
2015-09-28 23:37:42 +02:00
943e760ffd Add dump parameters.
Allows to dump both raw and cleaned up html.
2015-09-28 23:24:23 +02:00
00150938dd Fix the html cleaning code that was not really cleaning. 2015-09-28 22:59:09 +02:00
8e517e5de9 Parse options through boost program_options. 2015-09-28 21:48:46 +02:00
4f85fa01a9 Update libtidy and curlcpp. 2015-09-28 15:30:09 +02:00
44992458ac Quick dirty fix to avoid invalid characters in scripts.
Note that with this change scritps are stripped away, so
you won't find any <script></script> pair in the html.

Also print some more detailed info about errors.
2015-03-01 05:03:12 +01:00
3bfea89568 Drop tidy from the repo and import it as submodule. 2015-03-01 03:17:47 +01:00
0e077a4930 Refactoring to put html retrieval & cleaning into a separate file.
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
cb00e484fa Working example.
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
aa015ddd6a Working example.
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
c9de3d3389 Updating to tidy-html5.
See http://w3c.github.io/tidy-html5/
and https://github.com/w3c/tidy-html5.
2014-06-06 22:33:27 +02:00
e2d74fd092 Trying to use libtidy but it throws. 2014-06-06 22:22:12 +02:00
56f0736d1a Move headers into tidy/ subdirectory. 2014-06-06 21:34:01 +02:00
3182e098bb Import of libtidy with custom cmake file. 2014-06-06 21:18:30 +02:00
f213ce5411 First import 2014-06-06 20:24:24 +02:00