⌈⌋ branch:  freshcode

Artifact Content

Artifact b6ddd8d5a8c53fe080ff80fa0c6a4dc5c93e9917:

Wiki page [AutoupdateRegex] by mario 2014-10-12 04:25:19.
D 2014-10-12T04:25:19.625
L AutoupdateRegex
N text/x-markdown
P c24bbc1b08d86e26959b70fc6a3a9413cee9cd5d
U mario
W 5495
<img src="http://freshcode.club/img/drchangelog.png" align=right height=150 width=150 alt=birdy>
The [Autoupdate](wiki/Autoupdate) "regex" module is the most versatile for collecting release infos from project pages.  Besides **RegExp** matching (for text sources), it also supports **XPath** and **jQuery**-style selections now, which ease HTML project website scraping.

See also <a href="http://freshcode.club/drchangelog">Dr. Changelog</a> for trying it out.

### Field Rules

It can be configured in the *Autoupdate Rules/Regex* project field, where it expects a list of `key = ...` entries. Each key can list an URL, one or more RegExp, XPath or jQuery expressions.

    version = http://example.com/download.html
    version = /(\d+\.\d+(\.\d+)+)/

    changes = http://example.com/news.html
    changes = $("#main .release div.current")
    changes = /Summary:\s*(.+?)\R\R/smix

    scope = ~((minor|major) (bugfix|cleanup|security))~
    state = ~(stable|beta|prerelease)~i
    download = $("a.download").attr("href")

It will not update general project descriptions, but only `version=` and `changes=` or optionally `scope=`, `state=` and `download=`.

  *  URLs should preceed the extraction expressions.
  *  For regex rules the first capture group `(..)` will be used as result.
  *  All regex flags `/Umixus` are allowed, and a special `/*` match-all flag is provided.
  *  Use line breaks to separate rule assignments. Comments in between will effectively be ignored.
  *  Xpath expressions for example take the form `changes = (//ul)[1]/li`
  *  jQuery-style selectors can chain `$("div").find("#first")` multiple selector functions.
  *  Field/key names may be prefixed with `$` or `%` as in `$version = /([\d.]+)/`.

### URL sources

Initially the primary *Autoupdate URL* is used as source for extraction. It's equivalent to listing an URL for `version =`. Each subsequent field extraction will reuse the lastly retrieved page. Like-named URL entries in *[Other URLs](wiki/Other+URLs)* will also be recognized. 

### Regex multi-match /* flag

There's a special regex flag `/*` for a `preg_match_all` mode. It's used by the listing for the Linux kernel (which is a git log) for instance:

    changes = /^Date:.+\R\R\s+(.+)\s+[ ]commit/m*

Here multiple occurences will be found, and merged into a changelog list. (So it's somewhat like the `/g` flag in JavaScript.)

### Slicing

Oftentimes it's simpler to just narrow down the extraction area however. Therefore repeating `key=/regex/` specifiers often is useful:

    changes = /Changelog(.+?)\Z/s
    changes = /(.+)---/

It's sometimes sensible to mix XPath/jQuery extractions first and a regex thereafter to cut out the actual result:

    version = $("article h4")
    version = ~Version ([\d.]+)~

Matching rules thus iteratively isolate the field to be populated.

### jQuery-style selector chaining

Often it suffices to call the main `$()` CSS selector function. And one could again use multiple slicing rules, but many jQuery-style subfunctions can be chained in one line:

    changes = $(".article .first").next().find("li")

XPath and jQuery rule assignments can only be single-line directives. (Unlike RegExps with the /x flag, which can wrap around linebreaks.)

### References

See [regular-expressions.info](http://www.regular-expressions.info/) for a simple RegExp introduction. Otherwise check out [jQ & CSS selectors](http://standardista.com/jquery/) and the [w3.org spec](http://www.w3.org/TR/CSS2/selector.html) or [jQuery pseudo selectors](http://api.jquery.com/category/selectors/) for CSS selectors. And the [XPath / Selenium cheat sheet](https://www.simple-talk.com/dotnet/.net-framework/xpath,-css,-dom-and-selenium-the-rosetta-stone/) or an [Xpath/Regex overview](http://xpath.alephzarro.com/content/cheatsheet.html) for XPath examples.

### Examples Regex

If you use semantic versioning, then you can keep the `\d+.\d+.\d+` version= field. To allow for `-beta` or `-dev.2` prefixes even:

    version = /((\d+\.\d+(\.\d+)+(-\w+(?:\.\w+)*)*/

You can of course preceed this regex with more concrete context matches. If for example you were to use meta data comments:

    version = ~ ^\h* [/#*]+ \h*version:\h*  (\d+(?:\.\d+)+[-.\w]+) ~mix

Extracting a Changelog summary is more difficult. If you want to eschew manual release submissions on *freshcode.club* you may wish to adopt a coherent README or CHANGELOG scheme.

For example I use a `history\n------\n` marker in the README, where it's easy to match the pre-summarized changes:

    changes = /history\R-----+\R+[\d.]+\R(.+?)\R\R/s

The `\R` is a linebreak placeholder (all CR, LF, CRLF variants), and `\R\R` hence an empty line.

For the `changes` field any `-` or `#` and `*` at the start of lines get stripped, btw.

You still ought to keep the changelog in an end-user approachable writing style.

### hidden releases

If you can't uncover a suitable source for `$changes=` then your automated release submission will be classified as *hidden*. Thus the project entry will stay current, but no frontpage listing (or notification) will occur.

The regex module will also likely be rate limited, so won't rescan your website daily.

### interval= rule

All Autoupdate modules additionally support the `interval = 7` rule; the number specifying a minimum amount of days before any new release lookup is attempted.

Z e6343c9428edbadb68a87b3f95ff013e