modseccfg: Check-in [ada19bd287]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Comment:	html2mallard update: support direct .md conversion, and http:// url params, doc updates.
Downloads:	Tarball \| ZIP archive \| SQL archive
Timelines:	family \| ancestors \| descendants \| both \| trunk
Files:	files \| file ages \| folders
SHA3-256:	ada19bd287ce6efb72c815cc299114f72c0cc82b998aeb86d0e4273b09e5f891
User & Date:	mario 2021-03-26 12:13:35

Context

2021-03-29
21:23		Performance fix for pyrewrite in range() check. check-in: 950ea0eb0c user: mario tags: trunk
2021-03-26
12:13		html2mallard update: support direct .md conversion, and http:// url params, doc updates. check-in: ada19bd287 user: mario tags: trunk
2021-03-06
21:46		Requires msc_pyparser >= 1.1 (for the CRS 3.2 line continuation issue in 901.conf) check-in: 6d5c87b143 user: mario tags: trunk

Changes

Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to html2mallard/README.md.

Changes to html2mallard/html2mallard.py.

Changes to html2mallard/man/html2mallard.1.

Changes to html2mallard/man/html2mallard.md.

︙			︙
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46	* And API docs are least convertible (only tested mkdocstrings, source dump is omitted, and there's obviously no syntax colorization in yelp; alternatively try [mkgendocs](https://pypi.org/project/mkgendocs/)). * Primarily designed for mkdocs´ HTML output. But also contains some cleanup rules for [fossil](https://fossil-scm.org/) wiki pages (with [github](https://fossil.include-once.org/fossil-skins/wiki/GitHub) skin), and yelp-builds` html. * Conversion doesn't work well for sphinx output (not consistent enough). ## html2mallard Simple command line tool to convert a single .html file: html2mallard site/index.html > help/index.page Add a `-d`/`--debug` flag after the filename for details on the shortening process. html2mallard in.html --debug \| xmllint - --recover > out.page With [xmllint](http://xmlsoft.org/xmllint.html) to fix some unmatched tags. ## mkdocs-mallard Converts a list of mkdocs output files to *.page files. mkdocs-mallard	\| > > > > > > > > > > > > > > > > > > > > >	17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67	* And API docs are least convertible (only tested mkdocstrings, source dump is omitted, and there's obviously no syntax colorization in yelp; alternatively try [mkgendocs](https://pypi.org/project/mkgendocs/)). * Primarily designed for mkdocs´ HTML output. But also contains some cleanup rules for [fossil](https://fossil-scm.org/) wiki pages (with [github](https://fossil.include-once.org/fossil-skins/wiki/GitHub) skin), and yelp-builds` html. * Conversion doesn't work well for sphinx output (not consistent enough), nor GitHub wiki pages. ## html2mallard Simple command line tool to convert a single .html file: html2mallard site/index.html > help/index.page Add a `-d`/`--debug` flag after the filename for details on the shortening process. html2mallard in.html --debug \| xmllint - --recover > out.page With [xmllint](http://xmlsoft.org/xmllint.html) to fix some unmatched tags. Now also supports http:// urls for conversion: html2mallard http://wiki/index.html > index.page And directly converting from markdown: html2mallard index.md > index.page ## API There's basically just one main function in html2mallard: import html2mallard page = html2mallard.convert(html_file_content, fn) The filename parameter is just used to deduce id and/or title from. As convenience method there is also `page = html2mallard.convert_file(fn)`, which would also automatically invoke `markdown` conversion given such an extension, or even resolve an url as parameter. ## mkdocs-mallard Converts a list of mkdocs output files to *.page files. mkdocs-mallard
︙			︙
67 68 69 70 71 72 73 74 75 76 77 78 79 80	guess_lang: true plugins: - mkdocstrings Also depends on `use_directory_urls: false`, since the script only `glob()`s one level of `*.html` files. ## Adaption The first two `rewrite` rules likely require changes for other HTML sources or templates. Specifically `"^.+?</nav>"` should strip the initial boilerplate, else might need expansion. (Either in the `GENERAL HTML` or a new rewrite collection.)	> > > > > > > > > > >	88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112	guess_lang: true plugins: - mkdocstrings Also depends on `use_directory_urls: false`, since the script only `glob()`s one level of `*.html` files. ## Nav links Ensure the `index.page` contains a section like: <section id="nav" style="2column"> <subtitle>Topics</subtitle> </section> But not the recursive self-reference `<link type="guide" xref="index#nav"/>`. ## Adaption The first two `rewrite` rules likely require changes for other HTML sources or templates. Specifically `"^.+?</nav>"` should strip the initial boilerplate, else might need expansion. (Either in the `GENERAL HTML` or a new rewrite collection.)
︙			︙

1 2 3 4 5 6 7 8 ~~9 10~~ 11 12 13 14 15 16 17	#!/usr/bin/env python3 # api: cli # encoding: utf-8 # type: transform # title: HTML to mallard # description: convert mkdocs´ html output to mallard/yelp xml # category: documentation # keywords: mkdocs mallard ~~# version: 0.2.0 # depends: python (>= 3.6), python:PyYAML (>= 5.0)~~ # license: Public Domain # url: https://fossil.include-once.org/modseccfg/wiki/html2mallard # # Poor transformation approach, mostly salvaging some HTML structures # and reshuffling document body into mallard <page> with allowed # inline markup. #	\| \|	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17	#!/usr/bin/env python3 # api: cli # encoding: utf-8 # type: transform # title: HTML to mallard # description: convert mkdocs´ html output to mallard/yelp xml # category: documentation # keywords: mkdocs mallard # version: 0.3.0 # depends: python (>= 3.6), python:PyYAML (>= 5.0), python:markdown # license: Public Domain # url: https://fossil.include-once.org/modseccfg/wiki/html2mallard # # Poor transformation approach, mostly salvaging some HTML structures # and reshuffling document body into mallard <page> with allowed # inline markup. #
︙			︙
32 33 34 35 36 37 38 ~~39 40~~ 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63	# output template = dedent(""" <page xmlns="http://projectmallard.org/1.0/" type="guide" id="{id}"> <info> ~~~~<link type="guide" xref="index#nav"/>~~ {links}~~ <desc>{desc}</desc> <?http header="X-Generator: html2mallard" ?> </info> <title>{title}</title> {body} </page> ~~""").lstrip()~~ # regex all the way extract = { # meta info "mkdocs_page_name = \"(.?)\";": "title", ~~"<title>(.+?)</title>": "title",~~ '<meta name="description" content="(.+?)"[^>]>': "desc", '<a class="reference internal" href="(\w+).html">.+?</a>': "links", '<a class="trail" href="(\w+).html(#.+?)?" title=".+?">': "links", # flags '(<.+>)': "is_html", '(mkdocs)': "is_mkdocs", 'data-target="[#.]navbar-(collapse)"': "is_material",	< \| \| \| >	32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63	# output template = dedent(""" <page xmlns="http://projectmallard.org/1.0/" type="guide" id="{id}"> <info> {links} <desc>{desc}</desc> <?http header="X-Generator: html2mallard" ?> </info> <title>{title}</title> {body} </page> """).strip() # regex all the way extract = { # meta info "mkdocs_page_name = \"(.?)\";": "title", "<title>(?:\w+:\s)?(.+?)</title>": "title", "<(?:h1\|h2)>([^<]+)</(?:h1\|h2)>": "title", '<meta name="description" content="(.+?)"[^>]>': "desc", '<a class="reference internal" href="(\w+).html">.+?</a>': "links", '<a class="trail" href="(\w+).html(#.+?)?" title=".+?">': "links", # flags '(<.+>)': "is_html", '(mkdocs)': "is_mkdocs", 'data-target="[#.]navbar-(collapse)"': "is_material",
︙			︙
123 124 125 126 127 128 129 ~~130~~ 131 132 133 134 135 136 137	"<(?:h1\|h2)[^>]>(.+?)</(?:h1\|h2)>": "<title>\\1</title>", "<(?:h3\|h4)[^>]>(.+?)</(?:h3\|h4)>": "<subtitle>\\1</subtitle>", "<(?:h5\|h6)[^>]>(.+?)</(?:h5\|h7)>": "<em>\\1</em>", "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>", # lists "<ol[^>]>(.+?)</ol>": "<steps>\\1</steps>", "<ul[^>]>(.+?)</ul>": "<list>\\1</list>", ~~"<li[^>]>(.+?)</li>": "<item><p>\\1</p></item>",~~ "<dl[^>]>(.+?)</dl>": "<terms>\\1</terms>", "<dt[^>]>(.+?)</dt>": "<item><title>\\1</title>", "<dd[^>]>(.+?)</dd>": "<p>\\1</p></item>", # fix nested list \1 \2 \3 \4 "(<(?:item\|steps\|terms)>)<p> ([^<]+(?<!\s)) \s <(list\|steps\|terms)> \s* (.+?) </\\3>": "\\1<p>\\2</p>\n <\\3>\n<item><p>\\4 </\\3>\n</item>", # links	\|	123 124 125 126 127 128 129 130 131 132 133 134 135 136 137	"<(?:h1\|h2)[^>]>(.+?)</(?:h1\|h2)>": "<title>\\1</title>", "<(?:h3\|h4)[^>]>(.+?)</(?:h3\|h4)>": "<subtitle>\\1</subtitle>", "<(?:h5\|h6)[^>]>(.+?)</(?:h5\|h7)>": "<em>\\1</em>", "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>", # lists "<ol[^>]>(.+?)</ol>": "<steps>\\1</steps>", "<ul[^>]>(.+?)</ul>": "<list>\\1</list>", "<li\\b[^>]>(.+?)</li>": "<item><p>\\1</p></item>", "<dl[^>]>(.+?)</dl>": "<terms>\\1</terms>", "<dt[^>]>(.+?)</dt>": "<item><title>\\1</title>", "<dd[^>]>(.+?)</dd>": "<p>\\1</p></item>", # fix nested list \1 \2 \3 \4 "(<(?:item\|steps\|terms)>)<p> ([^<]+(?<!\s)) \s <(list\|steps\|terms)> \s* (.+?) </\\3>": "\\1<p>\\2</p>\n <\\3>\n<item><p>\\4 </\\3>\n</item>", # links
︙			︙
171 172 173 174 175 176 177 ~~178~~ ~~179~~ 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 ~~217~~ ~~218~~ 219 220 221 222 223 224 225	# strip lone </section>, empty spans "(<section>.+?</section>)\|</section>": "\\1", "(<span[^>]></span>)": "", "(<p[^>]><p[^>]>)(.+?)(</p></p>)": "<p>\\2</p>", } } ~~def~~ ~~convert(html,~~ ~~fn):~~ # prepare snippets for .format kwargs kw = { "id": re.sub("\W+", "_", re.sub("^.+/\|\.\w+$", "", fn)).lower(), "desc": "", "title": "", "body": "", "links": "", } for rx, name in extract.items(): m = re.search(rx, html) if m and (not name in kw or not kw[name]): if name == "links": kw[name] = ["".join(row) for row in re.findall(rx, html)] else: kw[name] = re.sub("&\w+;\|<.+?>", "", m.group(1)) if kw["links"]: kw["links"] = indent("\n".join(f"<link type=\"guide\" xref=\"{id}\"/>" for id in kw["links"]), prefix=" ") # simplify/convert html for (group, flag), patterns in rewrite.items(): if not flag in kw: # possibly skip rule group continue elif debug: sys.stderr.write(f"group: {group}\n") for rx, repl in patterns.items(): l = len(html) html = re.sub(rx, repl, html, 0, re.X\|re.M\|re.S\|re.I) if debug and l != len(html): sys.stderr.write(f"rewrite: {len(html) - l} bytes, pattern: ~{rx}~\n") kw["body"] = html # return converted return template.format(*kw) # single file def convert_file(fn): ~~with open(fn, "r", encoding="utf-8") as f: ~~# → html2mallard "site/index.html"~~~~ ~~return convert(~~f.read()~~, fn)~~ # process directory def mkdocs(): import yaml src = open("mkdocs.yml", "r") # → ought to be in current directory cfg = yaml.load(src, Loader=yaml.Loader) srcdir = cfg["site_dir"]	> > > \| > > > > > > \| > > > > > > > > > > > > > > > > > \| > > > > \|	171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255	# strip lone </section>, empty spans "(<section>.+?</section>)\|</section>": "\\1", "(<span[^>]></span>)": "", "(<p[^>]><p[^>]>)(.+?)(</p></p>)": "<p>\\2</p>", } } def convert(html, fn): """ Convert HTML to mallard page document. Parameters ---------- html : str HTML page source (`<html>...`) fn : str Original filename (`index.html`) Returns ------- str Converted mallard xml .page source """ # prepare snippets for .format kwargs kw = { "id": re.sub("\W+", "_", re.sub("^.+/\|\.\w+$", "", fn)).lower(), "desc": "", "title": "", "body": "", "links": "", } for rx, name in extract.items(): m = re.search(rx, html) if m and (not name in kw or not kw[name]): if name == "links": kw[name] = ["".join(row) for row in re.findall(rx, html)] else: kw[name] = re.sub("&\w+;\|<.+?>", "", m.group(1)) if kw["links"]: kw["links"] = indent("\n".join(f"<link type=\"guide\" xref=\"{id}\"/>" for id in kw["links"]), prefix=" ") if kw["id"] != "index": kw["links"] = """<link type="guide" xref="index#nav"/>\n""" + kw["links"] if not kw["title"]: kw["title"] = re.sub("^.+/\|\.\w+$", "", fn).title() # simplify/convert html for (group, flag), patterns in rewrite.items(): if not flag in kw: # possibly skip rule group continue elif debug: sys.stderr.write(f"group: {group}\n") for rx, repl in patterns.items(): l = len(html) html = re.sub(rx, repl, html, 0, re.X\|re.M\|re.S\|re.I) if debug and l != len(html): sys.stderr.write(f"rewrite: {len(html) - l} bytes, pattern: ~{rx}~\n") kw["body"] = html # return converted if kw["id"] == "index": kw["body"] = """<section id="nav">\n <!--<title>Topics</title>-->\n</section>\n""" + kw["body"] return template.format(*kw) # single file def convert_file(fn): html = "" if re.match("https?://.+", fn): # → html2mallard http://page.html import requests html = requests.get(fn).text fn = re.sub(".+/", "", fn) else: # → html2mallard "site/index.html" with open(fn, "r", encoding="utf-8") as f: html = f.read() if re.search("\.md$", fn): # → html2mallard page.md import markdown html = markdown.markdown(html) return convert(html, fn) # process directory def mkdocs(): import yaml src = open("mkdocs.yml", "r") # → ought to be in current directory cfg = yaml.load(src, Loader=yaml.Loader) srcdir = cfg["site_dir"]
︙			︙