GUI editor to tame mod_security rules

⌈⌋ branch:  modseccfg


Check-in [1aa3926f1f]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:html2mallard 0.2 with material theme recognition, slightly more structured regex rules
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: 1aa3926f1f9a63cf1687f3c505e4367585a340964fb4c9e2f8222f0caca69841
User & Date: mario 2021-01-12 22:48:34
Context
2021-01-12
22:50
Updated man pages for logfmt1 check-in: 6a4153d11f user: mario tags: trunk
22:48
html2mallard 0.2 with material theme recognition, slightly more structured regex rules check-in: 1aa3926f1f user: mario tags: trunk
2021-01-03
20:07
html2mallard split up rules for different templates, add --debug flag check-in: b7a9065f17 user: mario tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to html2mallard/README.md.

1
2
3
4


5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

21
22
23
24
25
26
27
28
29
30
31
32
33




34
35
36
37
38
39
40
## html2mallard / mkdocs-mallard

Extremely crude HTML to [mallard help](http://projectmallard.org/)
conversion.  Specifically for output from [mkdocs 1.x](https://www.mkdocs.org/)


with RTD theme.  It's a very basic regex extraction (→*I'm looking forward
to your letters!*) and filtering process.  It only retains some structural
elements (headlines, paragraphs, tables, lists, notes).  Doesn't even
attempt to gather any topic relation/structure from the navigation list.

 * Really only suitable for one-time/initial conversion.
 * Requires some editing to get pages to validate.  (Though they probably
   "work" in yelp as is).
 * Links and image references certainly require manual cleanup. Nested
   lists or tables are likely to cause issues.
 * And API docs are least convertible (only tested mkdocstrings, source
   dump is omitted, and there's obviously no syntax colorization in yelp;
   alternatively try [mkgendocs](https://pypi.org/project/mkgendocs/)).
 * Primarily designed for mkdocs´ HTML output.  But also contains some
   cleanup rules for [fossil](https://fossil-scm.org/) wiki pages (with
   [github](https://fossil.include-once.org/fossil-skins/wiki/GitHub) skin).

 * Conversion doesn't work well for sphinx output (not consistent enough).


## html2mallard

Simple command line tool to convert a single .html file:


    html2mallard site/index.html > help/index.page

Add a `-d`/`--debug` flag after the filename for details on the shortening
process.






## mkdocs-mallard

Converts a list of mkdocs output files to *.page files.

    mkdocs-mallard




|
>
>
|
|
|
|

|









|
>







<





>
>
>
>







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## html2mallard / mkdocs-mallard

Extremely crude HTML to [mallard help](http://projectmallard.org/)
conversion.  Specifically for output from [mkdocs](https://www.mkdocs.org/)
with RTD or Material theme.

It's a very basic regex extraction (→*I'm looking forward to your letters!*)
and filtering process.  It only retains some structural elements (headlines,
paragraphs, tables, lists, notes).  Doesn't even attempt to gather any topic
relation/structure from the navigation list.

 * Really just intended for one-time/initial conversion.
 * Requires some editing to get pages to validate.  (Though they probably
   "work" in yelp as is).
 * Links and image references certainly require manual cleanup. Nested
   lists or tables are likely to cause issues.
 * And API docs are least convertible (only tested mkdocstrings, source
   dump is omitted, and there's obviously no syntax colorization in yelp;
   alternatively try [mkgendocs](https://pypi.org/project/mkgendocs/)).
 * Primarily designed for mkdocs´ HTML output.  But also contains some
   cleanup rules for [fossil](https://fossil-scm.org/) wiki pages (with
   [github](https://fossil.include-once.org/fossil-skins/wiki/GitHub) skin),
   and yelp-builds` html.
 * Conversion doesn't work well for sphinx output (not consistent enough).


## html2mallard

Simple command line tool to convert a single .html file:


    html2mallard site/index.html > help/index.page

Add a `-d`/`--debug` flag after the filename for details on the shortening
process.

    html2mallard in.html --debug | xmllint - --recover > out.page

With [xmllint](http://xmlsoft.org/xmllint.html) to fix some unmatched tags.


## mkdocs-mallard

Converts a list of mkdocs output files to *.page files.

    mkdocs-mallard

66
67
68
69
70
71
72
73

74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
one level of `*.html` files.


## Adaption

The first two `rewrite` rules likely require changes for other HTML sources
or templates. Specifically `"^.+?</nav>"` should strip the initial
boilerplate, else might need expansion.



### from `project` import `meta`

| meta           | info                                                            |
|:---------------|:----------------------------------------------------------------|
| depends        | -                                                               |
| compat         | Python ≥3.6                                                     |
| compliancy     | !pep8, ~mallard, !doap                                          |
| system usage   | -                                                               |
| paths          | -                                                               |
| testing        | -                                                               |
| docs           | -                                                               |
| activity       | abandoned                                                       |
| state          | alpha                                                           |
| support        | -                                                               |
| contrib        | -                                                               |
| announce       | -                                                               |









|
>







|
|











72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
one level of `*.html` files.


## Adaption

The first two `rewrite` rules likely require changes for other HTML sources
or templates. Specifically `"^.+?</nav>"` should strip the initial
boilerplate, else might need expansion. (Either in the `GENERAL HTML` or
a new rewrite collection.)


### from `project` import `meta`

| meta           | info                                                            |
|:---------------|:----------------------------------------------------------------|
| depends        | -                                                               |
| compat         | Python ≥3.6, mkdocs 1.x                                         |
| compliancy     | !pep8, mallard, manpage, !doap, !xdg                            |
| system usage   | -                                                               |
| paths          | -                                                               |
| testing        | -                                                               |
| docs           | -                                                               |
| activity       | abandoned                                                       |
| state          | alpha                                                           |
| support        | -                                                               |
| contrib        | -                                                               |
| announce       | -                                                               |


Changes to html2mallard/html2mallard.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

17
18
19




20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

40
41
42
43
44
45
46
47
48
49
50
51
52
53

54

55
56
57

58
59

60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76




77
78
79



80
81
82
83
84
85



86
87
88
89
90
91
92
#!/usr/bin/env python3
# api: cli
# encoding: utf-8
# type: transform
# title: HTML to mallard
# description: convert mkdocs´ html output to mallard/yelp xml
# category: documentation
# keywords: mkdocs mallard
# version: 0.2
# depends: python (>= 3.6), python:PyYAML (>= 5.0)
# license: Public Domain
# url: https://fossil.include-once.org/modseccfg/wiki/html2mallard
# 
# Poor transformation approach, mostly salvaging some HTML structures
# and reshuffling document body into mallard <page> with allowed
# inline markup.

# XSLT might have been easier, but doesn't work on most HTML.
# BS/lxml is way overkill for this task (hence zero such tools).
# Noone's doing a markdown to ducktype/mallard converter either.






import os, sys
import re, html
from textwrap import dedent, indent
from glob import glob
import yaml


# output
template = dedent("""
    <page
        xmlns="http://projectmallard.org/1.0/"
        type="guide"
        id="{id}">

        <info>
            <link type="guide" xref="index#nav"/>
    {links}
            <desc>{desc}</desc>

        </info>

        <title>{title}</title>

        {body}

    </page>
""").lstrip()

# regex all the way
extract = {
    # meta info
    "mkdocs_page_name = \"(.*?)\";": "title",
    "<title>(.+?)</title>": "title",

    '<a class="reference internal" href="(\w+).html">.+?</a>': "links",

    # flags
    '(<.+>)': "is_html",
    '(mkdocs)': "is_mkdocs",

    '(fossil|timeline)': "is_fossil",
    "(SphinxRtdTheme|readthedocs-doc-embed.js|aria-label=)": "is_sphinx",

     "(&\w+;)": "has_entities",
    '(<p>|<div|<table|<li>|<img|<strong|<em|<h\d|<span|<code)': "convert",
}
rewrite = {
    # trim and cleanup
    ("GENERAL HTML", "is_html"): {
        "<script.+?</script>": "",
        "<head>.+?</head>": "",
        "<!DOCTYPE[^>]+>|<html[^>]*>|</body>|</html>": "",
        "<span></span>": "",
    },
    ("MKDOCS", "is_mkdocs"): {
        "^.+?</nav>": "",   # might strip too much for any bottom-navigation templates
        "^.+?<div\srole=\"main\">": "",   # mkdocs RTD template
        '<footer>.+\\Z': "",    # mkdocs footer
        'Next\s<span\sclass="icon\sicon-circle-arrow-right"></span>.+\\Z': "",   # mkdocs RTD theme
    },




    ("FOSSIL WIKI", "is_fossil"): {
        '<footer\sid="fossil-footer">.+\\Z': "", # fossil footer
        '\\A.+<main[^>]*>': "", # wiki header



    },
    ("RTD.IO/SPHINX", "is_sphinx"): {
        "^.+?</nav>": "",   # might strip too much for any bottom-navigation templates
        '<footer>.+\\Z': "", 
        '<div\srole="navigation"\saria-label="breadcrumbs\snavigation">.+?</div>': "",  # RTD.io
    },



    ("ENTITIES", "has_entities"): {
        "&rarrq;": "→",
        "&nbsp;": "␣",
        "&mdash;": "–",
        "&(?!lt|gt|amp)\w+;": lambda m: html.unescape(m[0]),
    },









|







>



>
>
>
>






<
|



<
|
|
<

|
|

|
>
|

|

|









>

>



>


>












|
|



>
>
>
>

<

>
>
>


|



>
>
>







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

31
32
33
34

35
36

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#!/usr/bin/env python3
# api: cli
# encoding: utf-8
# type: transform
# title: HTML to mallard
# description: convert mkdocs´ html output to mallard/yelp xml
# category: documentation
# keywords: mkdocs mallard
# version: 0.2.0
# depends: python (>= 3.6), python:PyYAML (>= 5.0)
# license: Public Domain
# url: https://fossil.include-once.org/modseccfg/wiki/html2mallard
# 
# Poor transformation approach, mostly salvaging some HTML structures
# and reshuffling document body into mallard <page> with allowed
# inline markup.
#
# XSLT might have been easier, but doesn't work on most HTML.
# BS/lxml is way overkill for this task (hence zero such tools).
# Noone's doing a markdown to ducktype/mallard converter either.
#
# Kinda only works because the mkdocs/markdown-generated HTML is
# fairly consistent. It's best combined with a `xmllint --recover`
# pipe anyhow.


import os, sys
import re, html
from textwrap import dedent, indent
from glob import glob

debug = True and re.search(" -+de?b?u?g?\\b", " ".join(sys.argv), re.I)

# output
template = dedent("""

    <page xmlns="http://projectmallard.org/1.0/"
     type="guide" id="{id}">


    <info>
        <link type="guide" xref="index#nav"/>
    {links}
        <desc>{desc}</desc>
        <?http header="X-Generator: html2mallard" ?>
    </info>

    <title>{title}</title>

    {body}

    </page>
""").lstrip()

# regex all the way
extract = {
    # meta info
    "mkdocs_page_name = \"(.*?)\";": "title",
    "<title>(.+?)</title>": "title",
    '<meta name="description" content="(.+?)"[^>]*>': "desc",
    '<a class="reference internal" href="(\w+).html">.+?</a>': "links",
    '<a class="trail" href="(\w+).html(#.+?)?" title=".+?">': "links",
    # flags
    '(<.+>)': "is_html",
    '(mkdocs)': "is_mkdocs",
    'data-target="[#.]navbar-(collapse)"': "is_material",
    '(fossil|timeline)': "is_fossil",
    "(SphinxRtdTheme|readthedocs-doc-embed.js|aria-label=)": "is_sphinx",
    '(<div class="inner pagewide">)': "is_yelphtml",
     "(&\w+;)": "has_entities",
    '(<p>|<div|<table|<li>|<img|<strong|<em|<h\d|<span|<code)': "convert",
}
rewrite = {
    # trim and cleanup
    ("GENERAL HTML", "is_html"): {
        "<script.+?</script>": "",
        "<head>.+?</head>": "",
        "<!DOCTYPE[^>]+>|<html[^>]*>|</body>|</html>": "",
        "<span></span>": "",
    },
    ("MKDOCS", "is_mkdocs"): {
        "\\A.+?</nav>": "",   # might strip too much for any bottom-navigation templates
        "\\A.+?<div[^>]+role=\"main\">": "",   # mkdocs RTD template
        '<footer>.+\\Z': "",    # mkdocs footer
        'Next\s<span\sclass="icon\sicon-circle-arrow-right"></span>.+\\Z': "",   # mkdocs RTD theme
    },
    ("MATERIAL", "is_material"): {
        '\\A.+<div[^>]+role="main">': "",
        '<div\sclass="modal"\sid="mkdocs_search_modal".+\\Z': "",
    },
    ("FOSSIL WIKI", "is_fossil"): {

        '\\A.+<main[^>]*>': "", # wiki header
        '<div\sclass="submenu">.+?</div>': "", # page header
        '<footer\sid=fossil-footer>.+\\Z': "", # fossil footer
        '<h2>Attachments:</h2><ul>.+\\Z': "", # page footer
    },
    ("RTD.IO/SPHINX", "is_sphinx"): {
        "\\A.+?</nav>": "",   # might strip too much for any bottom-navigation templates
        '<footer>.+\\Z': "", 
        '<div\srole="navigation"\saria-label="breadcrumbs\snavigation">.+?</div>': "",  # RTD.io
    },
    ("YELPHTML", "is_yelphtml"): {
        "\\A.+?</header><article>": "",
    },
    ("ENTITIES", "has_entities"): {
        "&rarrq;": "→",
        "&nbsp;": "␣",
        "&mdash;": "–",
        "&(?!lt|gt|amp)\w+;": lambda m: html.unescape(m[0]),
    },

105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123

124
125
126
127
128
129
130
        # headlines
        "(<h\d[^>]*>.+?(?<!\s))\s*(?=<h\d|<footer|</body|\Z)": "\n<section>\n\\1\n</section>\n",
        "<(?:h1|h2)[^>]*>(.+?)</(?:h1|h2)>": "<title>\\1</title>",
        "<(?:h3|h4)[^>]*>(.+?)</(?:h3|h4)>": "<subtitle>\\1</subtitle>",
        "<(?:h5|h6)[^>]*>(.+?)</(?:h5|h7)>": "<em>\\1</em>",
        "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>",
        # lists
        "<ol>(.+?)</ol>": "<steps>\\1</steps>",
        "<ul>(.+?)</ul>": "<list>\\1</list>",
        "<li>(.+?)</li>": "<item><p>\\1</p></item>",
        "<dl>(.+?)</dl>": "<terms>\\1</terms>",
        "<dt>(.+?)</dt>": "<item><title>\\1</title>",
        "<dd>(.+?)</dd>": "<p>\\1</p></item>",
        # fix nested list   \1         \2                 \3                      \4    
        "(<(?:item|steps|terms)>)<p> ([^<]+(?<!\s)) \s* <(list|steps|terms)> \s* (.+?) </\\3>":
            "\\1<p>\\2</p>\n <\\3>\n<item><p>\\4 </\\3>\n</item>",
        # links
        "<a\shref=\"([^\">]+)\.html\">(.+?)</a>": "<link type=\"seealso\" xref=\"\\1\">\\2</link>",
        "<a\shref=\"(\w+://[^\">]+)\">(.+?)</a>": "<link type=\"seealso\" href=\"\\1\">\\2</link>",

        # media
        "<img[^>]+src=\"(.+?)\"[^>]*>": "<media type=\"image\" src=\"\\1\" mime=\"image/png\" />",
        # tables
        "</?tbody>": "",
        "<table[^>]*>": "<table shade=\"rows cols\" rules=\"rows cols\"><tbody>",
        "</table>": "</tbody></table>",
        "<tr[^>]*>": "<tr>",







|
|
|
|
|
|




|
|
>







121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
        # headlines
        "(<h\d[^>]*>.+?(?<!\s))\s*(?=<h\d|<footer|</body|\Z)": "\n<section>\n\\1\n</section>\n",
        "<(?:h1|h2)[^>]*>(.+?)</(?:h1|h2)>": "<title>\\1</title>",
        "<(?:h3|h4)[^>]*>(.+?)</(?:h3|h4)>": "<subtitle>\\1</subtitle>",
        "<(?:h5|h6)[^>]*>(.+?)</(?:h5|h7)>": "<em>\\1</em>",
        "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>",
        # lists
        "<ol[^>]*>(.+?)</ol>": "<steps>\\1</steps>",
        "<ul[^>]*>(.+?)</ul>": "<list>\\1</list>",
        "<li[^>]*>(.+?)</li>": "<item><p>\\1</p></item>",
        "<dl[^>]*>(.+?)</dl>": "<terms>\\1</terms>",
        "<dt[^>]*>(.+?)</dt>": "<item><title>\\1</title>",
        "<dd[^>]*>(.+?)</dd>": "<p>\\1</p></item>",
        # fix nested list   \1         \2                 \3                      \4    
        "(<(?:item|steps|terms)>)<p> ([^<]+(?<!\s)) \s* <(list|steps|terms)> \s* (.+?) </\\3>":
            "\\1<p>\\2</p>\n <\\3>\n<item><p>\\4 </\\3>\n</item>",
        # links
        "<a\shref=\"([^\">]+)\.html\"[^>]*>(.+?)</a>": "<link type=\"seealso\" xref=\"\\1\">\\2</link>",
        "<a\shref=\"(\w+://[^\">]+)\"[^>]*>(.+?)</a>": "<link type=\"seealso\" href=\"\\1\">\\2</link>",
        "<a\shref=\"(\#[\w\-]+)\"[^>]*>(.+?)</a>": "<link xref=\"\\1\">\\2</link>",
        # media
        "<img[^>]+src=\"(.+?)\"[^>]*>": "<media type=\"image\" src=\"\\1\" mime=\"image/png\" />",
        # tables
        "</?tbody>": "",
        "<table[^>]*>": "<table shade=\"rows cols\" rules=\"rows cols\"><tbody>",
        "</table>": "</tbody></table>",
        "<tr[^>]*>": "<tr>",
150
151
152
153
154
155
156

157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203

204
205
206
207
208
209
210


211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227

    ("PRETTIFY", "is_html"): {
        # prettify sections
        "(<section>)(.+?)(</section>)": lambda m: f"{m[1]}\n{indent(m[2].strip(), prefix=' ')}\n{m[3]}",
        # strip lone </section>, empty spans
        "(<section>.+?</section>)|</section>": "\\1",
        "(<span[^>]*></span>)": "",

    }
}


def convert(html, fn, debug=False):

    # prepare snippets for .format kwargs
    kw = {
        "id": re.sub("^.+/|\.\w+$", "", fn),
        "desc": "",
        "title": "",
        "body": "",
        "links": "",
    }
    for rx, name in extract.items():
        m = re.search(rx, html)
        if m and (not name in kw or not kw[name]):
            if name == "links":
                kw[name] = re.findall(rx, html)
            else:
                kw[name] = re.sub("&\w+;|<.+?>", "", m.group(1))
    if kw["links"]:
        kw["links"] = indent("\n".join(f"<link type=\"guide\" xref=\"{id}\"/>" for id in kw["links"]), prefix=" "*8)
        
    # simplify/convert html
    for (group, flag), patterns in rewrite.items():
        if not flag in kw: # possibly skip rule group
            continue
        elif debug:
            sys.stderr.write(f"group: {group}\n")
        for rx, repl in patterns.items():
            l = len(html)
            html = re.sub(rx, repl, html, 0, re.X|re.M|re.S|re.I)
            if debug and l != len(html):
                sys.stderr.write(f"rewrite: {len(html) - l} bytes, pattern: ~{rx}~\n")
    kw["body"] = html
    
    # return converted
    return template.format(**kw)

# single file
def convert_file(fn, debug=0):
    with open(fn, "r", encoding="utf-8") as f:   # → html2mallard "site/index.html"
        return convert(f.read(), fn, debug)

# process directory
def mkdocs():

    src = open("mkdocs.yml", "r")   # → ought to be in current directory
    cfg = yaml.load(src, Loader=yaml.Loader)
    srcdir = cfg["site_dir"]
    target = cfg["mallard_dir"]    # → required param in mkdocs.yml
    if not os.path.exists(target):
        os.makedirs(target)
    for fn in glob(f"{srcdir}/*.html"):


        page = convert_file(fn)
        fn = re.sub(".+/", "", fn)
        fn = re.sub("\.html", ".page", fn)
        with open(f"{target}/{fn}", "w", encoding="utf-8") as f:
            f.write(page)

# entry_points
def main():
    if len(sys.argv) >= 2:
        dbg = set(["-d", "--debug", "-D"]) & set(sys.argv)
        print(convert_file(sys.argv[1], debug=dbg)) # first argument as input file
    else:
        mkdocs() # iterate through site/*html

if __name__ == "__main__":
    main()
    







>




|



|









|



|


















|

|



>







>
>









<
|






167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240

241
242
243
244
245
246
247

    ("PRETTIFY", "is_html"): {
        # prettify sections
        "(<section>)(.+?)(</section>)": lambda m: f"{m[1]}\n{indent(m[2].strip(), prefix=' ')}\n{m[3]}",
        # strip lone </section>, empty spans
        "(<section>.+?</section>)|</section>": "\\1",
        "(<span[^>]*></span>)": "",
        "(<p[^>]*><p[^>]*>)(.+?)(</p></p>)": "<p>\\2</p>",
    }
}


def convert(html, fn):

    # prepare snippets for .format kwargs
    kw = {
        "id": re.sub("\W+", "_", re.sub("^.+/|\.\w+$", "", fn)).lower(),
        "desc": "",
        "title": "",
        "body": "",
        "links": "",
    }
    for rx, name in extract.items():
        m = re.search(rx, html)
        if m and (not name in kw or not kw[name]):
            if name == "links":
                kw[name] = ["".join(row) for row in re.findall(rx, html)]
            else:
                kw[name] = re.sub("&\w+;|<.+?>", "", m.group(1))
    if kw["links"]:
        kw["links"] = indent("\n".join(f"<link type=\"guide\" xref=\"{id}\"/>" for id in kw["links"]), prefix="    ")
        
    # simplify/convert html
    for (group, flag), patterns in rewrite.items():
        if not flag in kw: # possibly skip rule group
            continue
        elif debug:
            sys.stderr.write(f"group: {group}\n")
        for rx, repl in patterns.items():
            l = len(html)
            html = re.sub(rx, repl, html, 0, re.X|re.M|re.S|re.I)
            if debug and l != len(html):
                sys.stderr.write(f"rewrite: {len(html) - l} bytes, pattern: ~{rx}~\n")
    kw["body"] = html
    
    # return converted
    return template.format(**kw)

# single file
def convert_file(fn):
    with open(fn, "r", encoding="utf-8") as f:   # → html2mallard "site/index.html"
        return convert(f.read(), fn)

# process directory
def mkdocs():
    import yaml
    src = open("mkdocs.yml", "r")   # → ought to be in current directory
    cfg = yaml.load(src, Loader=yaml.Loader)
    srcdir = cfg["site_dir"]
    target = cfg["mallard_dir"]    # → required param in mkdocs.yml
    if not os.path.exists(target):
        os.makedirs(target)
    for fn in glob(f"{srcdir}/*.html"):
        if debug:
            sys.stderr.write(f"--\nFILE: '{fn}' to {target}/*.page\n")
        page = convert_file(fn)
        fn = re.sub(".+/", "", fn)
        fn = re.sub("\.html", ".page", fn)
        with open(f"{target}/{fn}", "w", encoding="utf-8") as f:
            f.write(page)

# entry_points
def main():
    if len(sys.argv) >= 2:

        print(convert_file(sys.argv[1])) # first argument as input file
    else:
        mkdocs() # iterate through site/*html

if __name__ == "__main__":
    main()
    

Added html2mallard/man/html2mallard.1.

















































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
.\" Automatically generated by Pandoc 2.5
.\"
.TH "html2mallard" "1" "" "from modseccfg" "Version 0.2"
.hy
.SH NAME
.PP
\f[B]html2mallard\f[R] \[em] rough help conversion tool
.SH SYNOPSIS
.PP
\f[B]html2mallard\f[R] \f[I]input.html\f[R] \f[B]>\f[R]
\f[I]output.page\f[R]
.PP
\f[B]html2mallard\f[R] \f[I]input.html\f[R]
[\f[B]\-D\f[R]|\f[B]\-d\f[R]|\f[B]\-\-debug\f[R]]
.PP
\f[B]html2mallard\f[R] \f[I]input.html\f[R] | \f[B]xmllint\f[R]
\f[I]\-\f[R] \f[B]\-\-recover\f[R]
.PP
\f[B]mkdocs\-mallard\f[R]
.SH DESCRIPTION
.PP
Provides a rough conversion between mkdocs\[ga] generated HTML and
mallard/yelp files.
.PP
Where \f[B]mkdocs\-mallard\f[R] reads a bunch of files from the
\f[I]site_dir\f[R] defined in \f[I]mkkdocs.yml\f[R], and writes them to
\f[I]mallard_dir\f[R].
.SH PURPOSE
.PP
Poor transformation approach, mostly salvaging some HTML structures and
reshuffling document body into mallard with allowed inline markup.
.IP \[bu] 2
XSLT might have been easier, but doesn\[cq]t work on most HTML.
.IP \[bu] 2
BS/lxml is way overkill for this task (hence zero such tools).
.IP \[bu] 2
Noone\[cq]s doing a markdown to ducktype/mallard converter either.
.SH SEE ALSO
.PP
\f[B]https://pypi.org/project/html2mallard/\f[R], \f[B]xmllint\f[R](1)

Added html2mallard/man/html2mallard.md.

































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
% html2mallard(1) from modseccfg | Version 0.2


NAME
====

**html2mallard** — rough help conversion tool

SYNOPSIS
========

  **html2mallard** *input.html* **>** *output.page*

  **html2mallard** *input.html* \[**-D**|**-d**|**\-\-debug**\]

  **html2mallard** *input.html* \| **xmllint** *\-* **\-\-recover**

  **mkdocs-mallard**


DESCRIPTION
===========

Provides a rough conversion between mkdocs` generated HTML and
mallard/yelp files.

Where **mkdocs-mallard** reads a bunch of files from the *site_dir*
defined in *mkkdocs.yml*, and writes them to *mallard_dir*.


PURPOSE
=======

Poor transformation approach, mostly salvaging some HTML structures
and reshuffling document body into mallard <page> with allowed
inline markup.

 * XSLT might have been easier, but doesn't work on most HTML.
 * BS/lxml is way overkill for this task (hence zero such tools).
 * Noone's doing a markdown to ducktype/mallard converter either.



SEE ALSO
========

**https://pypi.org/project/html2mallard/**, **xmllint**(1)