fa08abd00d
Use iconv for converting html *before* passing input to tidyhtml.
...
Tidyhtml seems to be unable to convert from iso-8859-1 and I suspect
there will be many more failures in the future. So instead just
make sure all input to it is utf-8 and tell tidy to assume its
input is always utf-8.
2020-04-02 19:50:28 +02:00
54ac44b81d
Remove --dump-raw option.
...
It doesn't work and users can easily fetch the raw html
with wget, curl or even the browser anyways.
2020-02-19 17:13:34 +01:00
60d6c2cb61
Working on better scraplang support, still not there tho.
2020-02-18 10:27:52 +01:00
430886085c
Use XQilla and Xerces-c from the system instead of pugixml.
...
I don't think this commit works or even compiles, I have too many
changes and I have to start committing from somewhere. At the same
time I don't want to make a "lots of changes here and there" kind
of commit.
2020-02-18 10:19:51 +01:00
76f403b3ce
Extract read_all() functions into a separate file.
2018-02-08 00:54:17 +00:00
6dffe9b848
Writing the code to go from tree to mustache dictionary.
2018-01-17 23:24:35 +00:00
fcb25ed456
WiP reworking the AST interpreter.
2018-01-13 18:16:11 +00:00
f0e7a1d136
Trying to get scraplang implemented
...
Lots of changes I made on the train and had little
time to make tidily.
Use c++17 (for std::optional)
Clean up the cmake script a bit
Get rid of unused stuff
Skeleton implementation of some classes for scraplang
2018-01-10 20:25:19 +00:00
41b0f59039
Bump version to 0.2.1b
2015-10-01 15:32:30 +02:00
bdd50d2267
Refactor xpath query into a separate function.
2015-10-01 14:18:02 +02:00
dfd0ec343e
Implement parsing of scraplang.
2015-10-01 01:32:27 +02:00
05af365c58
Move command line parsing code to a new file.
2015-09-30 01:13:48 +02:00
c69252604c
Default to static tidy-html5, but let the user configure this.
2015-09-28 23:44:11 +02:00
8e517e5de9
Parse options through boost program_options.
2015-09-28 21:48:46 +02:00
4f85fa01a9
Update libtidy and curlcpp.
2015-09-28 15:30:09 +02:00
3bfea89568
Drop tidy from the repo and import it as submodule.
2015-03-01 03:17:47 +01:00
0e077a4930
Refactoring to put html retrieval & cleaning into a separate file.
...
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
cb00e484fa
Working example.
...
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
aa015ddd6a
Working example.
...
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
e2d74fd092
Trying to use libtidy but it throws.
2014-06-06 22:22:12 +02:00
f213ce5411
First import
2014-06-06 20:24:24 +02:00