Commit graph

21 commits

Author SHA1 Message Date
fa08abd00d Use iconv for converting html *before* passing input to tidyhtml.
Tidyhtml seems to be unable to convert from iso-8859-1 and I suspect
there will be many more failures in the future. So instead just
make sure all input to it is utf-8 and tell tidy to assume its
input is always utf-8.
2020-04-02 19:50:28 +02:00
54ac44b81d Remove --dump-raw option.
It doesn't work and users can easily fetch the raw html
with wget, curl or even the browser anyways.
2020-02-19 17:13:34 +01:00
60d6c2cb61 Working on better scraplang support, still not there tho. 2020-02-18 10:27:52 +01:00
430886085c Use XQilla and Xerces-c from the system instead of pugixml.
I don't think this commit works or even compiles, I have too many
changes and I have to start committing from somewhere. At the same
time I don't want to make a "lots of changes here and there" kind
of commit.
2020-02-18 10:19:51 +01:00
76f403b3ce Extract read_all() functions into a separate file. 2018-02-08 00:54:17 +00:00
6dffe9b848 Writing the code to go from tree to mustache dictionary. 2018-01-17 23:24:35 +00:00
fcb25ed456 WiP reworking the AST interpreter. 2018-01-13 18:16:11 +00:00
f0e7a1d136 Trying to get scraplang implemented
Lots of changes I made on the train and had little
time to make tidily.
Use c++17 (for std::optional)
Clean up the cmake script a bit
Get rid of unused stuff
Skeleton implementation of some classes for scraplang
2018-01-10 20:25:19 +00:00
41b0f59039 Bump version to 0.2.1b 2015-10-01 15:32:30 +02:00
bdd50d2267 Refactor xpath query into a separate function. 2015-10-01 14:18:02 +02:00
dfd0ec343e Implement parsing of scraplang. 2015-10-01 01:32:27 +02:00
05af365c58 Move command line parsing code to a new file. 2015-09-30 01:13:48 +02:00
c69252604c Default to static tidy-html5, but let the user configure this. 2015-09-28 23:44:11 +02:00
8e517e5de9 Parse options through boost program_options. 2015-09-28 21:48:46 +02:00
4f85fa01a9 Update libtidy and curlcpp. 2015-09-28 15:30:09 +02:00
3bfea89568 Drop tidy from the repo and import it as submodule. 2015-03-01 03:17:47 +01:00
0e077a4930 Refactoring to put html retrieval & cleaning into a separate file.
This version should also be capable of retrieving data from https urls.
2014-06-07 22:07:13 +02:00
cb00e484fa Working example.
Invoke it with ie:
./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'
2014-06-07 20:44:43 +02:00
aa015ddd6a Working example.
Tested with:
./scraper //meta[@name]
Note that libtidy adds a meta name=generator tag.
2014-06-07 01:15:06 +02:00
e2d74fd092 Trying to use libtidy but it throws. 2014-06-06 22:22:12 +02:00
f213ce5411 First import 2014-06-06 20:24:24 +02:00