fa08abd00d
Use iconv for converting html *before* passing input to tidyhtml.
...
Tidyhtml seems to be unable to convert from iso-8859-1 and I suspect
there will be many more failures in the future. So instead just
make sure all input to it is utf-8 and tell tidy to assume its
input is always utf-8.
2020-04-02 19:50:28 +02:00
d64a4af105
Bugfix - clone last entry, not first
2020-04-02 19:47:28 +02:00
9cd2608406
Allow comments in scraplang.
...
Re-use the ini comment skipper I wrote for kamokan. Comments
are the same as ini files in scraplang, use #
2020-04-02 16:42:35 +02:00
5a9e4e09a4
Remove the DOCTYPE comment some html pages have since it cause xqilla to freeze
2020-04-02 16:12:37 +02:00
329ccef6ef
Slightly improve code.
2020-04-02 02:28:30 +02:00
4958a83ddb
Trying to get default value filling fixed.
...
I'm not sure it always works, but the idea is that this should
find the maximum size of all arrays resulting from a scraplang
struct query and make all entries the same length filling them
with:
1) the default value if one was given in scraplang
2) the value of the last entry in the array being filled
3) just empty values
It seems to work but I can't say I ran very extensive tests.
I'll test more in the future and fix as needed.
2020-04-02 02:28:10 +02:00
830ab42c49
Remove useless namespace thing
2020-04-01 17:32:12 +02:00
32f87e5185
Add using std::string to reduce clutter.
2020-04-01 03:16:11 +02:00
b79d758e8e
Fix error when running multiple xpaths or something like that.
...
This fixes the xqilla exception being thrown:
"It is an error for the context item to be undefined when using it"
2020-04-01 03:14:26 +02:00
b536026f58
Setting a default namespace breaks queries when namespace is empty, so make it a parameter.
...
The new --namespace (-n) parameter defaults to http://www.w3.org/1999/xhtml
because it's easier to set it to "" on the command line than to that
long string.
2020-04-01 03:09:45 +02:00
55eb7c1fc0
XQUERY3 seems to work, gow for it!!
2020-03-31 20:11:30 +02:00
bdb858de5a
Ignore the stupid http://www.w3.org/1999/xhtml namespace some html have
...
I think this also fixes a memory leak with the xpath wide string.
2020-03-31 20:11:03 +02:00
6e35c880a4
Ask server to gzip data.
2020-02-19 17:22:08 +01:00
33866b3d6b
Add --from-code option for users to force the source charset.
2020-02-19 17:21:20 +01:00
54ac44b81d
Remove --dump-raw option.
...
It doesn't work and users can easily fetch the raw html
with wget, curl or even the browser anyways.
2020-02-19 17:13:34 +01:00
3dcbd48067
Fix hardcoded "always read from stdin" problem
2020-02-18 14:58:25 +01:00
5de2dfbe70
Allow empty default string values
2020-02-18 11:53:04 +01:00
d97cf03a34
Fix some exceptions crashing the program
2020-02-18 11:52:44 +01:00
7170347969
Update submodules
2020-02-18 10:43:07 +01:00
60d6c2cb61
Working on better scraplang support, still not there tho.
2020-02-18 10:27:52 +01:00
430886085c
Use XQilla and Xerces-c from the system instead of pugixml.
...
I don't think this commit works or even compiles, I have too many
changes and I have to start committing from somewhere. At the same
time I don't want to make a "lots of changes here and there" kind
of commit.
2020-02-18 10:19:51 +01:00
9dba8043f1
Set more options on tidyhtml.
2018-02-19 20:11:17 +00:00
494364c22e
Put the always-stdin symbol in the right place.
2018-02-15 10:39:46 +00:00
5d2c5863a5
Making ApplyBlocks work with {{variable}} sources.
2018-02-15 10:29:05 +00:00
b028e8c492
Lots of crap but it works. I'll improve code as I go.
2018-02-08 00:57:16 +00:00
a6916f6179
Read from stdin when source is -
2018-02-08 00:56:04 +00:00
1d750ad2f9
Accept - as a valid source URL.
2018-02-08 00:54:33 +00:00
76f403b3ce
Extract read_all() functions into a separate file.
2018-02-08 00:54:17 +00:00
84a599e771
Fix the code so it builds & runs.
...
But I'm not sure the result is correct, and
some implementations are still missing.
2018-02-05 21:41:38 +00:00
79ac7534f2
markdown code formatting
2018-02-05 21:40:36 +00:00
a9ff092401
WiP - do item counting in mstch variants correctly.
2018-01-30 10:39:33 +00:00
8d2c9f9013
Update tidy and curlcpp submodules.
2018-01-18 14:06:27 +00:00
b39621ea51
Fix the build but the code is still untested.
2018-01-18 00:16:17 +00:00
6dffe9b848
Writing the code to go from tree to mustache dictionary.
2018-01-17 23:24:35 +00:00
41bb315b02
Store the same tree for ApplyBlocks too.
2018-01-17 01:29:46 +00:00
2fd4daf52c
Keep FromBlock data in tree form.
2018-01-16 19:50:35 +00:00
26b912d66c
Allow dots in scraplang identifiers.
2018-01-16 10:42:25 +00:00
3572803f66
Allow nesting of structs.
2018-01-16 10:36:53 +00:00
fcb25ed456
WiP reworking the AST interpreter.
2018-01-13 18:16:11 +00:00
29f8fe299e
Still trying to get the AST interpreted without much luck.
...
I may be going down the wrong path so I'm committing and
scrapping everything to try a different approach.
2018-01-13 02:03:01 +00:00
f0e7a1d136
Trying to get scraplang implemented
...
Lots of changes I made on the train and had little
time to make tidily.
Use c++17 (for std::optional)
Clean up the cmake script a bit
Get rid of unused stuff
Skeleton implementation of some classes for scraplang
2018-01-10 20:25:19 +00:00
c31d317d51
Print a notice in the --help view in debug builds.
2015-10-01 15:37:09 +02:00
41b0f59039
Bump version to 0.2.1b
2015-10-01 15:32:30 +02:00
5d0a895978
Attach GPLv3.
2015-10-01 15:31:58 +02:00
bdd50d2267
Refactor xpath query into a separate function.
2015-10-01 14:18:02 +02:00
c9db1d8ba3
Wrap the unique_ptr so that dtor is called from the cpp.
...
This make it unnecessary to include scrapast.hpp in whatever
cpp takes ownership of the unique_ptr. Without this tho,
the destructor of unique_ptr would force you to include scrapast.
2015-10-01 01:45:42 +02:00
dfd0ec343e
Implement parsing of scraplang.
2015-10-01 01:32:27 +02:00
7dfd1f4a38
Update tidy (fixes the build with tidy as a submodule)
2015-09-30 16:33:52 +02:00
bf3b85498b
Add an option to customize the user agent at runtime.
2015-09-30 01:27:28 +02:00
c947eab83f
Show some readable message when being passed an unknown option.
2015-09-30 01:14:47 +02:00