duckscraper

Author	SHA1	Message	Date
King_DuckZ	fa08abd00d	Use iconv for converting html before passing input to tidyhtml. Tidyhtml seems to be unable to convert from iso-8859-1 and I suspect there will be many more failures in the future. So instead just make sure all input to it is utf-8 and tell tidy to assume its input is always utf-8.	2020-04-02 19:50:28 +02:00
King_DuckZ	54ac44b81d	Remove --dump-raw option. It doesn't work and users can easily fetch the raw html with wget, curl or even the browser anyways.	2020-02-19 17:13:34 +01:00
King_DuckZ	60d6c2cb61	Working on better scraplang support, still not there tho.	2020-02-18 10:27:52 +01:00
King_DuckZ	430886085c	Use XQilla and Xerces-c from the system instead of pugixml. I don't think this commit works or even compiles, I have too many changes and I have to start committing from somewhere. At the same time I don't want to make a "lots of changes here and there" kind of commit.	2020-02-18 10:19:51 +01:00
King_DuckZ	76f403b3ce	Extract read_all() functions into a separate file.	2018-02-08 00:54:17 +00:00
King_DuckZ	6dffe9b848	Writing the code to go from tree to mustache dictionary.	2018-01-17 23:24:35 +00:00
King_DuckZ	fcb25ed456	WiP reworking the AST interpreter.	2018-01-13 18:16:11 +00:00
King_DuckZ	f0e7a1d136	Trying to get scraplang implemented Lots of changes I made on the train and had little time to make tidily. Use c++17 (for std::optional) Clean up the cmake script a bit Get rid of unused stuff Skeleton implementation of some classes for scraplang	2018-01-10 20:25:19 +00:00
King_DuckZ	41b0f59039	Bump version to 0.2.1b	2015-10-01 15:32:30 +02:00
King_DuckZ	bdd50d2267	Refactor xpath query into a separate function.	2015-10-01 14:18:02 +02:00
King_DuckZ	dfd0ec343e	Implement parsing of scraplang.	2015-10-01 01:32:27 +02:00
King_DuckZ	05af365c58	Move command line parsing code to a new file.	2015-09-30 01:13:48 +02:00
King_DuckZ	c69252604c	Default to static tidy-html5, but let the user configure this.	2015-09-28 23:44:11 +02:00
King_DuckZ	8e517e5de9	Parse options through boost program_options.	2015-09-28 21:48:46 +02:00
King_DuckZ	4f85fa01a9	Update libtidy and curlcpp.	2015-09-28 15:30:09 +02:00
King_DuckZ	3bfea89568	Drop tidy from the repo and import it as submodule.	2015-03-01 03:17:47 +01:00
King_DuckZ	0e077a4930	Refactoring to put html retrieval & cleaning into a separate file. This version should also be capable of retrieving data from https urls.	2014-06-07 22:07:13 +02:00
King_DuckZ	cb00e484fa	Working example. Invoke it with ie: ./scraper http://www.dilbert.com '//div[@class='\''STR_Image'\'']/a/img/@src'	2014-06-07 20:44:43 +02:00
King_DuckZ	aa015ddd6a	Working example. Tested with: ./scraper //meta[@name] Note that libtidy adds a meta name=generator tag.	2014-06-07 01:15:06 +02:00
King_DuckZ	e2d74fd092	Trying to use libtidy but it throws.	2014-06-06 22:22:12 +02:00
King_DuckZ	f213ce5411	First import	2014-06-06 20:24:24 +02:00

21 commits