GUI editor to tame mod_security rules

⌈⌋ branch:  modseccfg


Check-in [b7a9065f17]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:html2mallard split up rules for different templates, add --debug flag
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: b7a9065f17f70ea48d900a821786b33bbb2558f4d632f055df53be23515f20cd
User & Date: mario 2021-01-03 20:07:21
Context
2021-01-12
22:48
html2mallard 0.2 with material theme recognition, slightly more structured regex rules check-in: 1aa3926f1f user: mario tags: trunk
2021-01-03
20:07
html2mallard split up rules for different templates, add --debug flag check-in: b7a9065f17 user: mario tags: trunk
20:06
Stub manpage for logfmt(5) check-in: 9be300bfed user: mario tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to html2mallard/README.md.

1
2
3
4

5
6
7
8
9
10

11

12






13
14
15
16
17
18
19
20
21



22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
## html2mallard / mkdocs-mallard

Extremely crude HTML to [mallard help](http://projectmallard.org/) conversion.
Specifically for output from mkdocs 1.x with RTD theme. It's a very basic regex

extraction and filtering process, that only retains some structural
elements (headlines, paragraphs, tables, lists, notes). Doesn't even
attempt to gather any topic relation/structure from the navigation list.

Really only suitable for one-time/initial conversion, and requires some
editing to get pages to validate. (Though they probably "work" in yelp

as is). Links certainly require manual cleanup. And API docs are least

convertible.








## html2mallard

Simple command line tool to convert a single .html file:


    html2mallard site/index.html > help/index.page





## mkdocs-mallard

Converts a list of mkdocs output files to *.page files. Requires an extra
`mallard_dir` in the `mallard.xml` config.

    mkdocs-mallard

Sample config:

    site_name: logfmt1
    docs_dir: docs
    site_dir: html
    mallard_dir: mallard
    use_directory_urls: false
    nav:


|
|
>
|
|


|
|
>
|
>
|
>
>
>
>
>
>









>
>
>



|
<



|







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

38
39
40
41
42
43
44
45
46
47
48
## html2mallard / mkdocs-mallard

Extremely crude HTML to [mallard help](http://projectmallard.org/)
conversion.  Specifically for output from [mkdocs 1.x](https://www.mkdocs.org/)
with RTD theme.  It's a very basic regex extraction (→*I'm looking forward
to your letters!*) and filtering process.  It only retains some structural
elements (headlines, paragraphs, tables, lists, notes).  Doesn't even
attempt to gather any topic relation/structure from the navigation list.

 * Really only suitable for one-time/initial conversion.
 * Requires some editing to get pages to validate.  (Though they probably
   "work" in yelp as is).
 * Links and image references certainly require manual cleanup. Nested
   lists or tables are likely to cause issues.
 * And API docs are least convertible (only tested mkdocstrings, source
   dump is omitted, and there's obviously no syntax colorization in yelp;
   alternatively try [mkgendocs](https://pypi.org/project/mkgendocs/)).
 * Primarily designed for mkdocs´ HTML output.  But also contains some
   cleanup rules for [fossil](https://fossil-scm.org/) wiki pages (with
   [github](https://fossil.include-once.org/fossil-skins/wiki/GitHub) skin).
 * Conversion doesn't work well for sphinx output (not consistent enough).


## html2mallard

Simple command line tool to convert a single .html file:


    html2mallard site/index.html > help/index.page

Add a `-d`/`--debug` flag after the filename for details on the shortening
process.


## mkdocs-mallard

Converts a list of mkdocs output files to *.page files.


    mkdocs-mallard

Requires an extra **`mallard_dir`** in the `mkdocs.yml` config:

    site_name: logfmt1
    docs_dir: docs
    site_dir: html
    mallard_dir: mallard
    use_directory_urls: false
    nav:
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
      - def_list
      - tables
      - markdown.extensions.codehilite:
          guess_lang: true
    plugins:
      - mkdocstrings

Note the `mallard_dir` and `use_directory_urls`. The script only scans
one level of `*.html` files.


## Adaption

The first two `rewrite` rules likely require changes for other HTML sources
or templates. Specifically `"^.+?</nav>"` should strip the initial
boilerplate, else might need expansion.


### from `project` import `meta`

| meta           | info                                                            |
|:---------------|:----------------------------------------------------------------|
| depends        | -                                                               |
| compat         | Python ≥3.6                                                     |
| compliancy     | !pep8, ~mallard, !doap                                          |
| system usage   | -                                                               |
| paths          | -                                                               |
| testing        | `None`                                                          |
| docs           | -                                                               |
| activity       | abandoned                                                       |
| state          | alpha                                                           |
| support        | `None`                                                          |
| contrib        | -                                                               |
| announce       | -                                                               |









|



















|



|




58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
      - def_list
      - tables
      - markdown.extensions.codehilite:
          guess_lang: true
    plugins:
      - mkdocstrings

Also depends on `use_directory_urls: false`, since the script only `glob()`s
one level of `*.html` files.


## Adaption

The first two `rewrite` rules likely require changes for other HTML sources
or templates. Specifically `"^.+?</nav>"` should strip the initial
boilerplate, else might need expansion.


### from `project` import `meta`

| meta           | info                                                            |
|:---------------|:----------------------------------------------------------------|
| depends        | -                                                               |
| compat         | Python ≥3.6                                                     |
| compliancy     | !pep8, ~mallard, !doap                                          |
| system usage   | -                                                               |
| paths          | -                                                               |
| testing        | -                                                               |
| docs           | -                                                               |
| activity       | abandoned                                                       |
| state          | alpha                                                           |
| support        | -                                                               |
| contrib        | -                                                               |
| announce       | -                                                               |


Changes to html2mallard/html2mallard.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

50
51
52







53
54
55
56
57
58
59
60
61




62



63
64



65



66
67
68
69
70

71
72

73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113



114

115
116
117
118
119
120
121
122

123

124
125
126
127
128
129
130
131

132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152





153

154


155
156
157
158
159
160
161
162
163
164

165
166
167
168
169
170
171
172
173
174
175
176
177
178

179
180

181
182
183
184
185
186
187
#!/usr/bin/env python3
# api: cli
# encoding: utf-8
# type: transform
# title: HTML to mallard
# description: convert mkdocs´ html output to mallard/yelp xml
# category: documentation
# keywords: mkdocs mallard
# version: 0.2
# depends: python (>= 3.6)
# license: Public Domain
# url: https://fossil.include-once.org/modseccfg/wiki/html2mallard
# 
# Poor transformation approach, mostly salvaging some HTML structures
# and reshuffling document body into mallard <page> and allowed
# inline markup.
# XSLT might have been easier, but doesn't work on HTML.
#



import os, sys
import re
from textwrap import dedent, indent
from glob import glob
import yaml


# output
template = dedent("""
    <page
        xmlns="http://projectmallard.org/1.0/"
        type="guide"
        id="{id}">

        <info>
            <link type="guide" xref="index#nav"/>
            {links}
            <desc>{desc}</desc>
        </info>

        <title>{title}</title>

        {body}

    </page>
""").lstrip()

# regex all the way
extract = {

    "mkdocs_page_name = \"(.*?)\";": "title",
    "<title>(.+?)</title>": "title",
    '<a class="reference internal" href="(\w+).html">.+?</a>': "links",







}
rewrite = {
    # trim and cleanup
    "^.+?</nav>": "",
    "^.+?<div\srole=\"main\">": "",   # mkdocs RTD template
    "<script.+?</script>": "",
    "<head>.+?</head>": "",
    "</body>|</html>": "",
    "<span></span>": "",




    '<footer>.+\\Z': "",    # mkdocs footer



    '<footer\sid="fossil-footer">.+\\Z': "", # fossil footer
    '\\A.+<main[^>]*>': "", # strip fossil wiki header



    'Next\s<span\sclass="icon\sicon-circle-arrow-right"></span>.+\\Z': "",



    "&rarrq;": "→",
    "&nbsp;": "␣",
    #"&quot;": "\"",
    #"&apos;": "\'"",
    "&(?!lt|gt|amp)\w+;": "",

    
    # actual conversions

    "<div\sclass=\"admonition\s(?:note|abstract|summary|tldr)\">(.+?)</div>": "<note style=\"tip\">\\1</note>",
    "<div\sclass=\"admonition\s(?:todo|seealso)\">(.+?)</div>": "<note style=\"advanced\">\\1</note>",
    "<div\sclass=\"admonition\s(?:danger|error|failure|fail|missing|bug)\">(.+?)</div>": "<note style=\"bug\">\\1</note>",
    "<div\sclass=\"admonition\s(?:info|todo)\">(.+?)</div>": "<note style=\"important\">\\1</note>",
    "<div\sclass=\"admonition\s(?:example|quote|cite)\">(.+?)</div>": "<note style=\"plain\">\\1</note>",
    "<div\sclass=\"admonition\s(?:question|help|faq)\">(.+?)</div>": "<note style=\"sidebar\">\\1</note>",
    "<div\sclass=\"admonition\s(?:notes|tip|hint|important)\">(.+?)</div>": "<note style=\"tip\">\\1</note>",
    "<div\sclass=\"admonition\s(?:warning|caution|attention)\">(.+?)</div>": "<note style=\"warning\">\\1</note>",
    "<div\sclass=\"admonition(?:\s\w+)?\">(.+?)</div>": "<note style=\"tip\">\\1</note>",
    "<p\sclass=\"admonition-title\">(.+?)</p>": "<subtitle>\\1</subtitle>",
    # headlines
    "(<h\d[^>]*>.+?(?<!\s))\s*(?=<h\d|<footer|</body|\Z)": "\n<section>\n\\1\n</section>\n",
    "<(?:h1|h2)[^>]*>(.+?)</(?:h1|h2)>": "<title>\\1</title>",
    "<(?:h3|h4)[^>]*>(.+?)</(?:h3|h4)>": "<subtitle>\\1</subtitle>",
    "<(?:h5|h6)[^>]*>(.+?)</(?:h5|h7)>": "<em>\\1</em>",
    "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>",
    # lists
    "<ol>(.+?)</ol>": "<steps>\\1</steps>",
    "<ul>(.+?)</ul>": "<list>\\1</list>",
    "<li>(.+?)</li>": "<item><p>\\1</p></item>",
    "<dl>(.+?)</dl>": "<terms>\\1</terms>",
    "<dt>(.+?)</dt>": "<item><title>\\1</title>",
    "<dd>(.+?)</dd>": "<p>\\1</p></item>",
    # fix nested list   \1         \2                 \3                      \4    
    "(<(?:item|steps|terms)>)<p> ([^<]+(?<!\s)) \s* <(list|steps|terms)> \s* (.+?) </\\3>":
        "\\1<p>\\2</p>\n <\\3>\n<item><p>\\4 </\\3>\n</item>",
    # links
    "<a\shref=\"([^\">]+)\.html\">(.+?)</a>": "<link type=\"seealso\" xref=\"\\1\">\\2</link>",
    "<a\shref=\"(\w+://[^\">]+)\">(.+?)</a>": "<link type=\"seealso\" href=\"\\1\">\\2</link>",
    # media
    "<img[^>]+src=\"(.+?)\"[^>]*>": "<media type=\"image\" src=\"\\1\" mime=\"image/png\" />",
    # tables
    "</?tbody>": "",
    "<table[^>]*>": "<table shade=\"rows cols\" rules=\"rows cols\"><tbody>",
    "</table>": "</tbody></table>",
    "<tr[^>]*>": "<tr>",
    "<(td|th)\\b[^>]*>": "    <td><p>",
    "</(td|th)\\b[^>]*>": "</p></td>",

    # strip codehilite markup
    "<span\sclass=\"\w{1,2}\">(.+?)</span>": "<span>\\1</span>",



    

    # strip any remaining non-mallard tags, except: |div|revision|thead
    """</? 
       (?!(?:page|section|info|credit|link|link|title|desc|title|keywords|license|desc|
       years|email|name|links|code|media|p|screen|quote|comment|example|figure|listing|
       note|synopsis|list|item|steps|item|terms|item|tree|item|table|col|colgroup|tr|
       tbody|tfoot|td|th|title|subtitle|desc|cite|app|code|cmd|output|em|file|gui|guiseq|hi|
       link|media|keyseq|key|span|sys|input|var)\\b)
       \w+[^>]* >""": "",



    # prettify sections
    "(<section>)(.+?)(</section>)": lambda m: f"{m[1]}\n{indent(m[2].strip(), prefix=' ')}\n{m[3]}",
    # strip lone </section>, empty spans
    "(<section>.+?</section>)|</section>": "\\1",
    "(<span[^>]*></span>)": "",
}



def convert(html, fn):

    # prepare snippets for .format kwargs
    kw = {
        "id": re.sub("^.+/|\.\w+$", "", fn),
        "desc": "",
        "title": "",
        "body": "",
        "links": "",
    }
    for rx, name in extract.items():
        m = re.search(rx, html)
        if m and not kw[name]:
            if name == "links":
                kw[name] = re.findall(rx, html)
            else:
                kw[name] = m.group(1)
    if kw["links"]:
        kw["links"] = "\n        ".join(f"<link type=\"guide\" xref=\"{id}\"/>" for id in kw["links"])
        
    # simplify/convert html





    for rx, repl in rewrite.items():

        html = re.sub(rx, repl, html, 0, re.X|re.M|re.S|re.I)


    kw["body"] = html
    
    # return converted
    return template.format(**kw)


def cnv_file(fn):
    with open(fn, "r", encoding="utf-8") as f:
        return convert(f.read(), fn)


def mkdocs():
    src = open("mkdocs.yml", "r") # mkdocs config should be in current directory
    cfg = yaml.load(src, Loader=yaml.Loader)
    srcdir = cfg["site_dir"]
    target = cfg["mallard_dir"]
    if not os.path.exists(target):
        os.makedirs(target)
    for fn in glob(f"{srcdir}/*.html"):
        page = cnv_file(fn)
        fn = re.sub(".+/", "", fn)
        fn = re.sub("\.html", ".page", fn)
        with open(f"{target}/{fn}", "w", encoding="utf-8") as f:
            f.write(page)


def main():
    if len(sys.argv) == 2:

        print(cnv_file(sys.argv[1])) # e.g. "site/index.html"
    else:
        mkdocs() # iterate through site/*html

if __name__ == "__main__":
    main()
    









|




|

|
|
>



|














|












>



>
>
>
>
>
>
>



|
<
|
|
|
|
>
>
>
>
|
>
>
>
|
|
>
>
>
|
>
>
>
|
|
|
<
|
>
|

>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
>
>
>
|
>
|
|
|
|
|
|
|
|
>

>
|
|
|
|
|
|
|

>
|











|



|

|


>
>
>
>
>
|
>
|
>
>





|
|
|
|

>

|


|



|





>

|
>
|






1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
#!/usr/bin/env python3
# api: cli
# encoding: utf-8
# type: transform
# title: HTML to mallard
# description: convert mkdocs´ html output to mallard/yelp xml
# category: documentation
# keywords: mkdocs mallard
# version: 0.2
# depends: python (>= 3.6), python:PyYAML (>= 5.0)
# license: Public Domain
# url: https://fossil.include-once.org/modseccfg/wiki/html2mallard
# 
# Poor transformation approach, mostly salvaging some HTML structures
# and reshuffling document body into mallard <page> with allowed
# inline markup.
# XSLT might have been easier, but doesn't work on most HTML.
# BS/lxml is way overkill for this task (hence zero such tools).
# Noone's doing a markdown to ducktype/mallard converter either.


import os, sys
import re, html
from textwrap import dedent, indent
from glob import glob
import yaml


# output
template = dedent("""
    <page
        xmlns="http://projectmallard.org/1.0/"
        type="guide"
        id="{id}">

        <info>
            <link type="guide" xref="index#nav"/>
    {links}
            <desc>{desc}</desc>
        </info>

        <title>{title}</title>

        {body}

    </page>
""").lstrip()

# regex all the way
extract = {
    # meta info
    "mkdocs_page_name = \"(.*?)\";": "title",
    "<title>(.+?)</title>": "title",
    '<a class="reference internal" href="(\w+).html">.+?</a>': "links",
    # flags
    '(<.+>)': "is_html",
    '(mkdocs)': "is_mkdocs",
    '(fossil|timeline)': "is_fossil",
    "(SphinxRtdTheme|readthedocs-doc-embed.js|aria-label=)": "is_sphinx",
     "(&\w+;)": "has_entities",
    '(<p>|<div|<table|<li>|<img|<strong|<em|<h\d|<span|<code)': "convert",
}
rewrite = {
    # trim and cleanup
    ("GENERAL HTML", "is_html"): {

        "<script.+?</script>": "",
        "<head>.+?</head>": "",
        "<!DOCTYPE[^>]+>|<html[^>]*>|</body>|</html>": "",
        "<span></span>": "",
    },
    ("MKDOCS", "is_mkdocs"): {
        "^.+?</nav>": "",   # might strip too much for any bottom-navigation templates
        "^.+?<div\srole=\"main\">": "",   # mkdocs RTD template
        '<footer>.+\\Z': "",    # mkdocs footer
        'Next\s<span\sclass="icon\sicon-circle-arrow-right"></span>.+\\Z': "",   # mkdocs RTD theme
    },
    ("FOSSIL WIKI", "is_fossil"): {
        '<footer\sid="fossil-footer">.+\\Z': "", # fossil footer
        '\\A.+<main[^>]*>': "", # wiki header
    },
    ("RTD.IO/SPHINX", "is_sphinx"): {
        "^.+?</nav>": "",   # might strip too much for any bottom-navigation templates
        '<footer>.+\\Z': "", 
        '<div\srole="navigation"\saria-label="breadcrumbs\snavigation">.+?</div>': "",  # RTD.io
    },
    ("ENTITIES", "has_entities"): {
        "&rarrq;": "→",
        "&nbsp;": "␣",
        "&mdash;": "",

        "&(?!lt|gt|amp)\w+;": lambda m: html.unescape(m[0]),
    },

    # actual conversions
    ("CONVERSIONS", "convert"): {
        "<div\sclass=\"admonition\s(?:note|abstract|summary|tldr)\">(.+?)</div>": "<note style=\"tip\">\\1</note>",
        "<div\sclass=\"admonition\s(?:todo|seealso)\">(.+?)</div>": "<note style=\"advanced\">\\1</note>",
        "<div\sclass=\"admonition\s(?:danger|error|failure|fail|missing|bug)\">(.+?)</div>": "<note style=\"bug\">\\1</note>",
        "<div\sclass=\"admonition\s(?:info|todo)\">(.+?)</div>": "<note style=\"important\">\\1</note>",
        "<div\sclass=\"admonition\s(?:example|quote|cite)\">(.+?)</div>": "<note style=\"plain\">\\1</note>",
        "<div\sclass=\"admonition\s(?:question|help|faq)\">(.+?)</div>": "<note style=\"sidebar\">\\1</note>",
        "<div\sclass=\"admonition\s(?:notes|tip|hint|important)\">(.+?)</div>": "<note style=\"tip\">\\1</note>",
        "<div\sclass=\"admonition\s(?:warning|caution|attention)\">(.+?)</div>": "<note style=\"warning\">\\1</note>",
        "<div\sclass=\"admonition(?:\s\w+)?\">(.+?)</div>": "<note style=\"tip\">\\1</note>",
        "<p\sclass=\"admonition-title\">(.+?)</p>": "<subtitle>\\1</subtitle>",
        # headlines
        "(<h\d[^>]*>.+?(?<!\s))\s*(?=<h\d|<footer|</body|\Z)": "\n<section>\n\\1\n</section>\n",
        "<(?:h1|h2)[^>]*>(.+?)</(?:h1|h2)>": "<title>\\1</title>",
        "<(?:h3|h4)[^>]*>(.+?)</(?:h3|h4)>": "<subtitle>\\1</subtitle>",
        "<(?:h5|h6)[^>]*>(.+?)</(?:h5|h7)>": "<em>\\1</em>",
        "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>",
        # lists
        "<ol>(.+?)</ol>": "<steps>\\1</steps>",
        "<ul>(.+?)</ul>": "<list>\\1</list>",
        "<li>(.+?)</li>": "<item><p>\\1</p></item>",
        "<dl>(.+?)</dl>": "<terms>\\1</terms>",
        "<dt>(.+?)</dt>": "<item><title>\\1</title>",
        "<dd>(.+?)</dd>": "<p>\\1</p></item>",
        # fix nested list   \1         \2                 \3                      \4    
        "(<(?:item|steps|terms)>)<p> ([^<]+(?<!\s)) \s* <(list|steps|terms)> \s* (.+?) </\\3>":
            "\\1<p>\\2</p>\n <\\3>\n<item><p>\\4 </\\3>\n</item>",
        # links
        "<a\shref=\"([^\">]+)\.html\">(.+?)</a>": "<link type=\"seealso\" xref=\"\\1\">\\2</link>",
        "<a\shref=\"(\w+://[^\">]+)\">(.+?)</a>": "<link type=\"seealso\" href=\"\\1\">\\2</link>",
        # media
        "<img[^>]+src=\"(.+?)\"[^>]*>": "<media type=\"image\" src=\"\\1\" mime=\"image/png\" />",
        # tables
        "</?tbody>": "",
        "<table[^>]*>": "<table shade=\"rows cols\" rules=\"rows cols\"><tbody>",
        "</table>": "</tbody></table>",
        "<tr[^>]*>": "<tr>",
        "<(td|th)\\b[^>]*>": "    <td><p>",
        "</(td|th)\\b[^>]*>": "</p></td>",

        # strip codehilite markup
        "<span\sclass=\"\w{1,2}\">(.+?)</span>": "<span>\\1</span>",
        "<span\sclass=\"([\w\-\s]+)\">(.+?)</span>": "<span style=\"\\1\">\\2</span>",
        "<code\sclass=\"([\w\-\s]+)\">(.+?)</code>": "<code><span style=\"\\1\">\\2</span></code>",
    },
     
    ("HTML BEGONE", "is_html"): { 
        # strip any remaining non-mallard tags, except: |div|revision|thead
        """</? 
           (?!(?:page|section|info|credit|link|link|title|desc|title|keywords|license|desc|
           years|email|name|links|code|media|p|screen|quote|comment|example|figure|listing|
           note|synopsis|list|item|steps|item|terms|item|tree|item|table|col|colgroup|tr|
           tbody|tfoot|td|th|title|subtitle|desc|cite|app|code|cmd|output|em|file|gui|guiseq|hi|
           link|media|keyseq|key|span|sys|input|var)\\b)
           \w+[^>]* >""": "",
    },

    ("PRETTIFY", "is_html"): {
        # prettify sections
        "(<section>)(.+?)(</section>)": lambda m: f"{m[1]}\n{indent(m[2].strip(), prefix=' ')}\n{m[3]}",
        # strip lone </section>, empty spans
        "(<section>.+?</section>)|</section>": "\\1",
        "(<span[^>]*></span>)": "",
    }
}


def convert(html, fn, debug=False):

    # prepare snippets for .format kwargs
    kw = {
        "id": re.sub("^.+/|\.\w+$", "", fn),
        "desc": "",
        "title": "",
        "body": "",
        "links": "",
    }
    for rx, name in extract.items():
        m = re.search(rx, html)
        if m and (not name in kw or not kw[name]):
            if name == "links":
                kw[name] = re.findall(rx, html)
            else:
                kw[name] = re.sub("&\w+;|<.+?>", "", m.group(1))
    if kw["links"]:
        kw["links"] = indent("\n".join(f"<link type=\"guide\" xref=\"{id}\"/>" for id in kw["links"]), prefix=" "*8)
        
    # simplify/convert html
    for (group, flag), patterns in rewrite.items():
        if not flag in kw: # possibly skip rule group
            continue
        elif debug:
            sys.stderr.write(f"group: {group}\n")
        for rx, repl in patterns.items():
            l = len(html)
            html = re.sub(rx, repl, html, 0, re.X|re.M|re.S|re.I)
            if debug and l != len(html):
                sys.stderr.write(f"rewrite: {len(html) - l} bytes, pattern: ~{rx}~\n")
    kw["body"] = html
    
    # return converted
    return template.format(**kw)

# single file
def convert_file(fn, debug=0):
    with open(fn, "r", encoding="utf-8") as f:   # → html2mallard "site/index.html"
        return convert(f.read(), fn, debug)

# process directory
def mkdocs():
    src = open("mkdocs.yml", "r")   # ought to be in current directory
    cfg = yaml.load(src, Loader=yaml.Loader)
    srcdir = cfg["site_dir"]
    target = cfg["mallard_dir"]    # → required param in mkdocs.yml
    if not os.path.exists(target):
        os.makedirs(target)
    for fn in glob(f"{srcdir}/*.html"):
        page = convert_file(fn)
        fn = re.sub(".+/", "", fn)
        fn = re.sub("\.html", ".page", fn)
        with open(f"{target}/{fn}", "w", encoding="utf-8") as f:
            f.write(page)

# entry_points
def main():
    if len(sys.argv) >= 2:
        dbg = set(["-d", "--debug", "-D"]) & set(sys.argv)
        print(convert_file(sys.argv[1], debug=dbg)) # first argument as input file
    else:
        mkdocs() # iterate through site/*html

if __name__ == "__main__":
    main()
    

Changes to html2mallard/setup.py.

11
12
13
14
15
16
17

18
19
20
21
22
23
24

setup(
    debug=1,
    fn="html2mallard.py",
    long_description="README.md",
    packages=[""],
    package_dir={"": "."},

    entry_points={
        "console_scripts": [
            "html2mallard=html2mallard:main",
            "mkdocs-mallard=html2mallard:mkdocs",
        ]
    }
)







>







11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

setup(
    debug=1,
    fn="html2mallard.py",
    long_description="README.md",
    packages=[""],
    package_dir={"": "."},
    py_modules=['html2mallard'],
    entry_points={
        "console_scripts": [
            "html2mallard=html2mallard:main",
            "mkdocs-mallard=html2mallard:mkdocs",
        ]
    }
)