Commit graph

69 commits

Author SHA1 Message Date
fa08abd00d Use iconv for converting html *before* passing input to tidyhtml.
Tidyhtml seems to be unable to convert from iso-8859-1 and I suspect
there will be many more failures in the future. So instead just
make sure all input to it is utf-8 and tell tidy to assume its
input is always utf-8.
2020-04-02 19:50:28 +02:00
d64a4af105 Bugfix - clone last entry, not first 2020-04-02 19:47:28 +02:00
9cd2608406 Allow comments in scraplang.
Re-use the ini comment skipper I wrote for kamokan. Comments
are the same as ini files in scraplang, use #
2020-04-02 16:42:35 +02:00
5a9e4e09a4 Remove the DOCTYPE comment some html pages have since it cause xqilla to freeze 2020-04-02 16:12:37 +02:00
329ccef6ef Slightly improve code. 2020-04-02 02:28:30 +02:00
4958a83ddb Trying to get default value filling fixed.
I'm not sure it always works, but the idea is that this should
find the maximum size of all arrays resulting from a scraplang
struct query and make all entries the same length filling them
with:
1) the default value if one was given in scraplang
2) the value of the last entry in the array being filled
3) just empty values

It seems to work but I can't say I ran very extensive tests.
I'll test more in the future and fix as needed.
2020-04-02 02:28:10 +02:00
830ab42c49 Remove useless namespace thing 2020-04-01 17:32:12 +02:00
32f87e5185 Add using std::string to reduce clutter. 2020-04-01 03:16:11 +02:00
b79d758e8e Fix error when running multiple xpaths or something like that.
This fixes the xqilla exception being thrown:
"It is an error for the context item to be undefined when using it"
2020-04-01 03:14:26 +02:00
b536026f58 Setting a default namespace breaks queries when namespace is empty, so make it a parameter.
The new --namespace (-n) parameter defaults to http://www.w3.org/1999/xhtml
because it's easier to set it to "" on the command line than to that
long string.
2020-04-01 03:09:45 +02:00
55eb7c1fc0 XQUERY3 seems to work, gow for it!! 2020-03-31 20:11:30 +02:00
bdb858de5a Ignore the stupid http://www.w3.org/1999/xhtml namespace some html have
I think this also fixes a memory leak with the xpath wide string.
2020-03-31 20:11:03 +02:00
6e35c880a4 Ask server to gzip data. 2020-02-19 17:22:08 +01:00
33866b3d6b Add --from-code option for users to force the source charset. 2020-02-19 17:21:20 +01:00
54ac44b81d Remove --dump-raw option.
It doesn't work and users can easily fetch the raw html
with wget, curl or even the browser anyways.
2020-02-19 17:13:34 +01:00
3dcbd48067 Fix hardcoded "always read from stdin" problem 2020-02-18 14:58:25 +01:00
5de2dfbe70 Allow empty default string values 2020-02-18 11:53:04 +01:00
d97cf03a34 Fix some exceptions crashing the program 2020-02-18 11:52:44 +01:00
7170347969 Update submodules 2020-02-18 10:43:07 +01:00
60d6c2cb61 Working on better scraplang support, still not there tho. 2020-02-18 10:27:52 +01:00
430886085c Use XQilla and Xerces-c from the system instead of pugixml.
I don't think this commit works or even compiles, I have too many
changes and I have to start committing from somewhere. At the same
time I don't want to make a "lots of changes here and there" kind
of commit.
2020-02-18 10:19:51 +01:00
9dba8043f1 Set more options on tidyhtml. 2018-02-19 20:11:17 +00:00
494364c22e Put the always-stdin symbol in the right place. 2018-02-15 10:39:46 +00:00
5d2c5863a5 Making ApplyBlocks work with {{variable}} sources. 2018-02-15 10:29:05 +00:00
b028e8c492 Lots of crap but it works. I'll improve code as I go. 2018-02-08 00:57:16 +00:00
a6916f6179 Read from stdin when source is - 2018-02-08 00:56:04 +00:00
1d750ad2f9 Accept - as a valid source URL. 2018-02-08 00:54:33 +00:00
76f403b3ce Extract read_all() functions into a separate file. 2018-02-08 00:54:17 +00:00
84a599e771 Fix the code so it builds & runs.
But I'm not sure the result is correct, and
some implementations are still missing.
2018-02-05 21:41:38 +00:00
79ac7534f2 markdown code formatting 2018-02-05 21:40:36 +00:00
a9ff092401 WiP - do item counting in mstch variants correctly. 2018-01-30 10:39:33 +00:00
8d2c9f9013 Update tidy and curlcpp submodules. 2018-01-18 14:06:27 +00:00
b39621ea51 Fix the build but the code is still untested. 2018-01-18 00:16:17 +00:00
6dffe9b848 Writing the code to go from tree to mustache dictionary. 2018-01-17 23:24:35 +00:00
41bb315b02 Store the same tree for ApplyBlocks too. 2018-01-17 01:29:46 +00:00
2fd4daf52c Keep FromBlock data in tree form. 2018-01-16 19:50:35 +00:00
26b912d66c Allow dots in scraplang identifiers. 2018-01-16 10:42:25 +00:00
3572803f66 Allow nesting of structs. 2018-01-16 10:36:53 +00:00
fcb25ed456 WiP reworking the AST interpreter. 2018-01-13 18:16:11 +00:00
29f8fe299e Still trying to get the AST interpreted without much luck.
I may be going down the wrong path so I'm committing and
scrapping everything to try a different approach.
2018-01-13 02:03:01 +00:00
f0e7a1d136 Trying to get scraplang implemented
Lots of changes I made on the train and had little
time to make tidily.
Use c++17 (for std::optional)
Clean up the cmake script a bit
Get rid of unused stuff
Skeleton implementation of some classes for scraplang
2018-01-10 20:25:19 +00:00
c31d317d51 Print a notice in the --help view in debug builds. 2015-10-01 15:37:09 +02:00
41b0f59039 Bump version to 0.2.1b 2015-10-01 15:32:30 +02:00
5d0a895978 Attach GPLv3. 2015-10-01 15:31:58 +02:00
bdd50d2267 Refactor xpath query into a separate function. 2015-10-01 14:18:02 +02:00
c9db1d8ba3 Wrap the unique_ptr so that dtor is called from the cpp.
This make it unnecessary to include scrapast.hpp in whatever
cpp takes ownership of the unique_ptr. Without this tho,
the destructor of unique_ptr would force you to include scrapast.
2015-10-01 01:45:42 +02:00
dfd0ec343e Implement parsing of scraplang. 2015-10-01 01:32:27 +02:00
7dfd1f4a38 Update tidy (fixes the build with tidy as a submodule) 2015-09-30 16:33:52 +02:00
bf3b85498b Add an option to customize the user agent at runtime. 2015-09-30 01:27:28 +02:00
c947eab83f Show some readable message when being passed an unknown option. 2015-09-30 01:14:47 +02:00