7170347969
Update submodules
2020-02-18 10:43:07 +01:00
60d6c2cb61
Working on better scraplang support, still not there tho.
2020-02-18 10:27:52 +01:00
430886085c
Use XQilla and Xerces-c from the system instead of pugixml.
...
I don't think this commit works or even compiles, I have too many
changes and I have to start committing from somewhere. At the same
time I don't want to make a "lots of changes here and there" kind
of commit.
2020-02-18 10:19:51 +01:00
9dba8043f1
Set more options on tidyhtml.
2018-02-19 20:11:17 +00:00
494364c22e
Put the always-stdin symbol in the right place.
2018-02-15 10:39:46 +00:00
5d2c5863a5
Making ApplyBlocks work with {{variable}} sources.
2018-02-15 10:29:05 +00:00
b028e8c492
Lots of crap but it works. I'll improve code as I go.
2018-02-08 00:57:16 +00:00
a6916f6179
Read from stdin when source is -
2018-02-08 00:56:04 +00:00
1d750ad2f9
Accept - as a valid source URL.
2018-02-08 00:54:33 +00:00
76f403b3ce
Extract read_all() functions into a separate file.
2018-02-08 00:54:17 +00:00
84a599e771
Fix the code so it builds & runs.
...
But I'm not sure the result is correct, and
some implementations are still missing.
2018-02-05 21:41:38 +00:00
79ac7534f2
markdown code formatting
2018-02-05 21:40:36 +00:00
a9ff092401
WiP - do item counting in mstch variants correctly.
2018-01-30 10:39:33 +00:00
8d2c9f9013
Update tidy and curlcpp submodules.
2018-01-18 14:06:27 +00:00
b39621ea51
Fix the build but the code is still untested.
2018-01-18 00:16:17 +00:00
6dffe9b848
Writing the code to go from tree to mustache dictionary.
2018-01-17 23:24:35 +00:00
41bb315b02
Store the same tree for ApplyBlocks too.
2018-01-17 01:29:46 +00:00
2fd4daf52c
Keep FromBlock data in tree form.
2018-01-16 19:50:35 +00:00
26b912d66c
Allow dots in scraplang identifiers.
2018-01-16 10:42:25 +00:00
3572803f66
Allow nesting of structs.
2018-01-16 10:36:53 +00:00
fcb25ed456
WiP reworking the AST interpreter.
2018-01-13 18:16:11 +00:00
29f8fe299e
Still trying to get the AST interpreted without much luck.
...
I may be going down the wrong path so I'm committing and
scrapping everything to try a different approach.
2018-01-13 02:03:01 +00:00
f0e7a1d136
Trying to get scraplang implemented
...
Lots of changes I made on the train and had little
time to make tidily.
Use c++17 (for std::optional)
Clean up the cmake script a bit
Get rid of unused stuff
Skeleton implementation of some classes for scraplang
2018-01-10 20:25:19 +00:00
c31d317d51
Print a notice in the --help view in debug builds.
2015-10-01 15:37:09 +02:00
41b0f59039
Bump version to 0.2.1b
2015-10-01 15:32:30 +02:00
5d0a895978
Attach GPLv3.
2015-10-01 15:31:58 +02:00
bdd50d2267
Refactor xpath query into a separate function.
2015-10-01 14:18:02 +02:00
c9db1d8ba3
Wrap the unique_ptr so that dtor is called from the cpp.
...
This make it unnecessary to include scrapast.hpp in whatever
cpp takes ownership of the unique_ptr. Without this tho,
the destructor of unique_ptr would force you to include scrapast.
2015-10-01 01:45:42 +02:00
dfd0ec343e
Implement parsing of scraplang.
2015-10-01 01:32:27 +02:00
7dfd1f4a38
Update tidy (fixes the build with tidy as a submodule)
2015-09-30 16:33:52 +02:00
bf3b85498b
Add an option to customize the user agent at runtime.
2015-09-30 01:27:28 +02:00
c947eab83f
Show some readable message when being passed an unknown option.
2015-09-30 01:14:47 +02:00
05af365c58
Move command line parsing code to a new file.
2015-09-30 01:13:48 +02:00
c304ffbbf0
Don't detect if it's a tty - only read from stdin when url is -
2015-09-29 21:04:28 +02:00
db1311839d
Check fstats instead of using isatty().
2015-09-29 17:40:01 +02:00
c69252604c
Default to static tidy-html5, but let the user configure this.
2015-09-28 23:44:11 +02:00
49aa62815a
Allow piping.
...
Atm you still need to specify some parameter for the url, even if
it's not needed. The good news is that the value doesn't have to
be a valid URL, so any string will do.
2015-09-28 23:37:42 +02:00
943e760ffd
Add dump parameters.
...
Allows to dump both raw and cleaned up html.
2015-09-28 23:24:23 +02:00
00150938dd
Fix the html cleaning code that was not really cleaning.
2015-09-28 22:59:09 +02:00
8e517e5de9
Parse options through boost program_options.
2015-09-28 21:48:46 +02:00
4f85fa01a9
Update libtidy and curlcpp.
2015-09-28 15:30:09 +02:00
44992458ac
Quick dirty fix to avoid invalid characters in scripts.
...
Note that with this change scritps are stripped away, so
you won't find any <script></script> pair in the html.
Also print some more detailed info about errors.
2015-03-01 05:03:12 +01:00
3bfea89568
Drop tidy from the repo and import it as submodule.
2015-03-01 03:17:47 +01:00
0e077a4930
Refactoring to put html retrieval & cleaning into a separate file.
...
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
cb00e484fa
Working example.
...
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
aa015ddd6a
Working example.
...
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
c9de3d3389
Updating to tidy-html5.
...
See http://w3c.github.io/tidy-html5/
and https://github.com/w3c/tidy-html5 .
2014-06-06 22:33:27 +02:00
e2d74fd092
Trying to use libtidy but it throws.
2014-06-06 22:22:12 +02:00
56f0736d1a
Move headers into tidy/ subdirectory.
2014-06-06 21:34:01 +02:00
3182e098bb
Import of libtidy with custom cmake file.
2014-06-06 21:18:30 +02:00