Check-in [ada19bd287]
Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | html2mallard update: support direct .md conversion, and http:// url params, doc updates. |
---|---|
Downloads: | Tarball | ZIP archive | SQL archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA3-256: |
ada19bd287ce6efb72c815cc299114f7 |
User & Date: | mario 2021-03-26 12:13:35 |
Context
2021-03-29
| ||
21:23 | Performance fix for pyrewrite in range() check. check-in: 950ea0eb0c user: mario tags: trunk | |
2021-03-26
| ||
12:13 | html2mallard update: support direct .md conversion, and http:// url params, doc updates. check-in: ada19bd287 user: mario tags: trunk | |
2021-03-06
| ||
21:46 | Requires msc_pyparser >= 1.1 (for the CRS 3.2 line continuation issue in 901.conf) check-in: 6d5c87b143 user: mario tags: trunk | |
Changes
Changes to html2mallard/README.md.
︙ | ︙ | |||
17 18 19 20 21 22 23 | * And API docs are least convertible (only tested mkdocstrings, source dump is omitted, and there's obviously no syntax colorization in yelp; alternatively try [mkgendocs](https://pypi.org/project/mkgendocs/)). * Primarily designed for mkdocs´ HTML output. But also contains some cleanup rules for [fossil](https://fossil-scm.org/) wiki pages (with [github](https://fossil.include-once.org/fossil-skins/wiki/GitHub) skin), and yelp-builds` html. | | > > > > > > > > > > > > > > > > > > > > > | 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | * And API docs are least convertible (only tested mkdocstrings, source dump is omitted, and there's obviously no syntax colorization in yelp; alternatively try [mkgendocs](https://pypi.org/project/mkgendocs/)). * Primarily designed for mkdocs´ HTML output. But also contains some cleanup rules for [fossil](https://fossil-scm.org/) wiki pages (with [github](https://fossil.include-once.org/fossil-skins/wiki/GitHub) skin), and yelp-builds` html. * Conversion doesn't work well for sphinx output (not consistent enough), nor GitHub wiki pages. ## html2mallard Simple command line tool to convert a single .html file: html2mallard site/index.html > help/index.page Add a `-d`/`--debug` flag after the filename for details on the shortening process. html2mallard in.html --debug | xmllint - --recover > out.page With [xmllint](http://xmlsoft.org/xmllint.html) to fix some unmatched tags. Now also supports http:// urls for conversion: html2mallard http://wiki/index.html > index.page And directly converting from markdown: html2mallard index.md > index.page ## API There's basically just one main function in html2mallard: import html2mallard page = html2mallard.convert(html_file_content, fn) The filename parameter is just used to deduce id and/or title from. As convenience method there is also `page = html2mallard.convert_file(fn)`, which would also automatically invoke `markdown` conversion given such an extension, or even resolve an url as parameter. ## mkdocs-mallard Converts a list of mkdocs output files to *.page files. mkdocs-mallard |
︙ | ︙ | |||
67 68 69 70 71 72 73 74 75 76 77 78 79 80 | guess_lang: true plugins: - mkdocstrings Also depends on `use_directory_urls: false`, since the script only `glob()`s one level of `*.html` files. ## Adaption The first two `rewrite` rules likely require changes for other HTML sources or templates. Specifically `"^.+?</nav>"` should strip the initial boilerplate, else might need expansion. (Either in the `GENERAL HTML` or a new rewrite collection.) | > > > > > > > > > > > | 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | guess_lang: true plugins: - mkdocstrings Also depends on `use_directory_urls: false`, since the script only `glob()`s one level of `*.html` files. ## Nav links Ensure the `index.page` contains a section like: <section id="nav" style="2column"> <subtitle>Topics</subtitle> </section> But not the recursive self-reference `<link type="guide" xref="index#nav"/>`. ## Adaption The first two `rewrite` rules likely require changes for other HTML sources or templates. Specifically `"^.+?</nav>"` should strip the initial boilerplate, else might need expansion. (Either in the `GENERAL HTML` or a new rewrite collection.) |
︙ | ︙ |
Changes to html2mallard/html2mallard.py.
1 2 3 4 5 6 7 8 | #!/usr/bin/env python3 # api: cli # encoding: utf-8 # type: transform # title: HTML to mallard # description: convert mkdocs´ html output to mallard/yelp xml # category: documentation # keywords: mkdocs mallard | | | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #!/usr/bin/env python3 # api: cli # encoding: utf-8 # type: transform # title: HTML to mallard # description: convert mkdocs´ html output to mallard/yelp xml # category: documentation # keywords: mkdocs mallard # version: 0.3.0 # depends: python (>= 3.6), python:PyYAML (>= 5.0), python:markdown # license: Public Domain # url: https://fossil.include-once.org/modseccfg/wiki/html2mallard # # Poor transformation approach, mostly salvaging some HTML structures # and reshuffling document body into mallard <page> with allowed # inline markup. # |
︙ | ︙ | |||
32 33 34 35 36 37 38 | # output template = dedent(""" <page xmlns="http://projectmallard.org/1.0/" type="guide" id="{id}"> <info> | < | | | > | 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | # output template = dedent(""" <page xmlns="http://projectmallard.org/1.0/" type="guide" id="{id}"> <info> {links} <desc>{desc}</desc> <?http header="X-Generator: html2mallard" ?> </info> <title>{title}</title> {body} </page> """).strip() # regex all the way extract = { # meta info "mkdocs_page_name = \"(.*?)\";": "title", "<title>(?:\w+:\s)?(.+?)</title>": "title", "<(?:h1|h2)>([^<]+)</(?:h1|h2)>": "title", '<meta name="description" content="(.+?)"[^>]*>': "desc", '<a class="reference internal" href="(\w+).html">.+?</a>': "links", '<a class="trail" href="(\w+).html(#.+?)?" title=".+?">': "links", # flags '(<.+>)': "is_html", '(mkdocs)': "is_mkdocs", 'data-target="[#.]navbar-(collapse)"': "is_material", |
︙ | ︙ | |||
123 124 125 126 127 128 129 | "<(?:h1|h2)[^>]*>(.+?)</(?:h1|h2)>": "<title>\\1</title>", "<(?:h3|h4)[^>]*>(.+?)</(?:h3|h4)>": "<subtitle>\\1</subtitle>", "<(?:h5|h6)[^>]*>(.+?)</(?:h5|h7)>": "<em>\\1</em>", "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>", # lists "<ol[^>]*>(.+?)</ol>": "<steps>\\1</steps>", "<ul[^>]*>(.+?)</ul>": "<list>\\1</list>", | | | 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | "<(?:h1|h2)[^>]*>(.+?)</(?:h1|h2)>": "<title>\\1</title>", "<(?:h3|h4)[^>]*>(.+?)</(?:h3|h4)>": "<subtitle>\\1</subtitle>", "<(?:h5|h6)[^>]*>(.+?)</(?:h5|h7)>": "<em>\\1</em>", "<strong>(.+?)</strong>": "<em style=\"strong\">\\1</em>", # lists "<ol[^>]*>(.+?)</ol>": "<steps>\\1</steps>", "<ul[^>]*>(.+?)</ul>": "<list>\\1</list>", "<li\\b[^>]*>(.+?)</li>": "<item><p>\\1</p></item>", "<dl[^>]*>(.+?)</dl>": "<terms>\\1</terms>", "<dt[^>]*>(.+?)</dt>": "<item><title>\\1</title>", "<dd[^>]*>(.+?)</dd>": "<p>\\1</p></item>", # fix nested list \1 \2 \3 \4 "(<(?:item|steps|terms)>)<p> ([^<]+(?<!\s)) \s* <(list|steps|terms)> \s* (.+?) </\\3>": "\\1<p>\\2</p>\n <\\3>\n<item><p>\\4 </\\3>\n</item>", # links |
︙ | ︙ | |||
171 172 173 174 175 176 177 | # strip lone </section>, empty spans "(<section>.+?</section>)|</section>": "\\1", "(<span[^>]*></span>)": "", "(<p[^>]*><p[^>]*>)(.+?)(</p></p>)": "<p>\\2</p>", } } | > > > | > > > > > > | > > > > > > > > > > > > > > > > > | > > > > | | 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | # strip lone </section>, empty spans "(<section>.+?</section>)|</section>": "\\1", "(<span[^>]*></span>)": "", "(<p[^>]*><p[^>]*>)(.+?)(</p></p>)": "<p>\\2</p>", } } def convert(html, fn): """ Convert HTML to mallard page document. Parameters ---------- html : str HTML page source (`<html>...`) fn : str Original filename (`index.html`) Returns ------- str Converted mallard xml .page source """ # prepare snippets for .format kwargs kw = { "id": re.sub("\W+", "_", re.sub("^.+/|\.\w+$", "", fn)).lower(), "desc": "", "title": "", "body": "", "links": "", } for rx, name in extract.items(): m = re.search(rx, html) if m and (not name in kw or not kw[name]): if name == "links": kw[name] = ["".join(row) for row in re.findall(rx, html)] else: kw[name] = re.sub("&\w+;|<.+?>", "", m.group(1)) if kw["links"]: kw["links"] = indent("\n".join(f"<link type=\"guide\" xref=\"{id}\"/>" for id in kw["links"]), prefix=" ") if kw["id"] != "index": kw["links"] = """<link type="guide" xref="index#nav"/>\n""" + kw["links"] if not kw["title"]: kw["title"] = re.sub("^.+/|\.\w+$", "", fn).title() # simplify/convert html for (group, flag), patterns in rewrite.items(): if not flag in kw: # possibly skip rule group continue elif debug: sys.stderr.write(f"group: {group}\n") for rx, repl in patterns.items(): l = len(html) html = re.sub(rx, repl, html, 0, re.X|re.M|re.S|re.I) if debug and l != len(html): sys.stderr.write(f"rewrite: {len(html) - l} bytes, pattern: ~{rx}~\n") kw["body"] = html # return converted if kw["id"] == "index": kw["body"] = """<section id="nav">\n <!--<title>Topics</title>-->\n</section>\n""" + kw["body"] return template.format(**kw) # single file def convert_file(fn): html = "" if re.match("https?://.+", fn): # → html2mallard http://page.html import requests html = requests.get(fn).text fn = re.sub(".+/", "", fn) else: # → html2mallard "site/index.html" with open(fn, "r", encoding="utf-8") as f: html = f.read() if re.search("\.md$", fn): # → html2mallard page.md import markdown html = markdown.markdown(html) return convert(html, fn) # process directory def mkdocs(): import yaml src = open("mkdocs.yml", "r") # → ought to be in current directory cfg = yaml.load(src, Loader=yaml.Loader) srcdir = cfg["site_dir"] |
︙ | ︙ |
Changes to html2mallard/man/html2mallard.1.
1 2 | .\" Automatically generated by Pandoc 2.5 .\" | | | > | < < > > | | | > > > > > > > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | .\" Automatically generated by Pandoc 2.5 .\" .TH "html2mallard" "1" "" "from modseccfg" "Version 0.3" .hy .SH NAME .PP \f[B]html2mallard\f[R] \[em] rough help conversion tool .SH SYNOPSIS .PP \f[B]html2mallard\f[R] [ \f[I]input.html\f[R] | \f[I]input.md\f[R] | \f[I]http://example.com/input.html\f[R] ] \f[B]>\f[R] \f[I]output.page\f[R] .PP \f[B]html2mallard\f[R] \f[I]input.html\f[R] [\f[B]\-D\f[R]|\f[B]\-d\f[R]|\f[B]\-\-debug\f[R]] | \f[B]xmllint\f[R] \f[I]\-\f[R] \f[B]\-\-recover\f[R] .PP \f[B]mkdocs\-mallard\f[R] .SH DESCRIPTION .PP Provides a rough conversion between mkdocs\[ga] generated HTML and mallard/yelp files. Also accepts \f[I]*.md\f[R] input files (converted per markdown.markdown), or even remote *.html files (per requests). .PP Whereas \f[B]mkdocs\-mallard\f[R] reads a bunch of files from the \f[I]site_dir\f[R] defined in \f[I]mkdocs.yml\f[R], and writes them to \f[I]mallard_dir\f[R]. .SH PURPOSE .PP Poor transformation approach, mostly salvaging some HTML structures and reshuffling document body into mallard <page> with allowed inline markup. .IP \[bu] 2 XSLT might have been easier, but doesn\[cq]t work on most HTML. .IP \[bu] 2 BS/lxml is way overkill for this task (hence zero such tools). .IP \[bu] 2 Noone\[cq]s doing a markdown to ducktype/mallard converter either. .PP Generated pages often require some post\-editing, such as removing duplicate \f[B]<title>s\f[R] or empty \f[B]<section>s\f[R], or adding a \f[B]<desc>\f[R]. Mallard help also requires an \f[I]index.page\f[R], ideally with a \f[B]<section id=\[lq]nav\[rq]>\f[R], so other pages automatically link there. (The index.page itself should not carry the \f[B]<link type=\[lq]guide\[rq] xref=\[lq]index#nav\[rq]/>\f[R], as that would be recursive.) .SH SEE ALSO .PP \f[B]https://pypi.org/project/html2mallard/\f[R], \f[B]xmllint\f[R](1) |
Changes to html2mallard/man/html2mallard.md.
|
| | | | < < | > | | | | > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | % html2mallard(1) from modseccfg | Version 0.3 NAME ==== **html2mallard** — rough help conversion tool SYNOPSIS ======== **html2mallard** \[ *input.html* | *input.md* | *http://example.com/input.html* ] **>** *output.page* **html2mallard** *input.html* \[**-D**|**-d**|**\-\-debug**\] \| **xmllint** *\-* **\-\-recover** **mkdocs-mallard** DESCRIPTION =========== Provides a rough conversion between mkdocs` generated HTML and mallard/yelp files. Also accepts *\*.md* input files (converted per markdown.markdown), or even remote \*.html files (per requests). Whereas **mkdocs-mallard** reads a bunch of files from the *site_dir* defined in *mkdocs.yml*, and writes them to *mallard_dir*. PURPOSE ======= Poor transformation approach, mostly salvaging some HTML structures and reshuffling document body into mallard \<page> with allowed inline markup. * XSLT might have been easier, but doesn't work on most HTML. * BS/lxml is way overkill for this task (hence zero such tools). * Noone's doing a markdown to ducktype/mallard converter either. Generated pages often require some post-editing, such as removing duplicate **\<title>s** or empty **\<section>s**, or adding a **\<desc>**. Mallard help also requires an *index.page*, ideally with a **\<section id="nav">**, so other pages automatically link there. (The index.page itself should not carry the **\<link type="guide" xref="index#nav"/>**, as that would be recursive.) SEE ALSO ======== **https://pypi.org/project/html2mallard/**, **xmllint**(1) |