⌈⌋ branch:  freshcode


Update of "AutoupdateRegex"

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Artifact ID: b6ddd8d5a8c53fe080ff80fa0c6a4dc5c93e9917
Page Name:AutoupdateRegex
Date: 2014-10-12 04:25:19
Original User: mario
Mimetype:text/x-markdown
Parent: c24bbc1b08d86e26959b70fc6a3a9413cee9cd5d
Content

birdy The Autoupdate "regex" module is the most versatile for collecting release infos from project pages. Besides RegExp matching (for text sources), it also supports XPath and jQuery-style selections now, which ease HTML project website scraping.

See also Dr. Changelog for trying it out.

Field Rules

It can be configured in the Autoupdate Rules/Regex project field, where it expects a list of key = ... entries. Each key can list an URL, one or more RegExp, XPath or jQuery expressions.

version = http://example.com/download.html
version = /(\d+\.\d+(\.\d+)+)/

changes = http://example.com/news.html
changes = $("#main .release div.current")
changes = /Summary:\s*(.+?)\R\R/smix

scope = ~((minor|major) (bugfix|cleanup|security))~
state = ~(stable|beta|prerelease)~i
download = $("a.download").attr("href")

It will not update general project descriptions, but only version= and changes= or optionally scope=, state= and download=.

  • URLs should preceed the extraction expressions.
  • For regex rules the first capture group (..) will be used as result.
  • All regex flags /Umixus are allowed, and a special /* match-all flag is provided.
  • Use line breaks to separate rule assignments. Comments in between will effectively be ignored.
  • Xpath expressions for example take the form changes = (//ul)[1]/li
  • jQuery-style selectors can chain $("div").find("#first") multiple selector functions.
  • Field/key names may be prefixed with $ or % as in $version = /([\d.]+)/.

URL sources

Initially the primary Autoupdate URL is used as source for extraction. It's equivalent to listing an URL for version =. Each subsequent field extraction will reuse the lastly retrieved page. Like-named URL entries in Other URLs will also be recognized.

Regex multi-match /* flag

There's a special regex flag /* for a preg_match_all mode. It's used by the listing for the Linux kernel (which is a git log) for instance:

changes = /^Date:.+\R\R\s+(.+)\s+[ ]commit/m*

Here multiple occurences will be found, and merged into a changelog list. (So it's somewhat like the /g flag in JavaScript.)

Slicing

Oftentimes it's simpler to just narrow down the extraction area however. Therefore repeating key=/regex/ specifiers often is useful:

changes = /Changelog(.+?)\Z/s
changes = /(.+)---/

It's sometimes sensible to mix XPath/jQuery extractions first and a regex thereafter to cut out the actual result:

version = $("article h4")
version = ~Version ([\d.]+)~

Matching rules thus iteratively isolate the field to be populated.

jQuery-style selector chaining

Often it suffices to call the main $() CSS selector function. And one could again use multiple slicing rules, but many jQuery-style subfunctions can be chained in one line:

changes = $(".article .first").next().find("li")

XPath and jQuery rule assignments can only be single-line directives. (Unlike RegExps with the /x flag, which can wrap around linebreaks.)

References

See regular-expressions.info for a simple RegExp introduction. Otherwise check out jQ & CSS selectors and the w3.org spec or jQuery pseudo selectors for CSS selectors. And the XPath / Selenium cheat sheet or an Xpath/Regex overview for XPath examples.

Examples Regex

If you use semantic versioning, then you can keep the \d+.\d+.\d+ version= field. To allow for -beta or -dev.2 prefixes even:

version = /((\d+\.\d+(\.\d+)+(-\w+(?:\.\w+)*)*/

You can of course preceed this regex with more concrete context matches. If for example you were to use meta data comments:

version = ~ ^\h* [/#*]+ \h*version:\h*  (\d+(?:\.\d+)+[-.\w]+) ~mix

Extracting a Changelog summary is more difficult. If you want to eschew manual release submissions on freshcode.club you may wish to adopt a coherent README or CHANGELOG scheme.

For example I use a history\n------\n marker in the README, where it's easy to match the pre-summarized changes:

changes = /history\R-----+\R+[\d.]+\R(.+?)\R\R/s

The \R is a linebreak placeholder (all CR, LF, CRLF variants), and \R\R hence an empty line.

For the changes field any - or # and * at the start of lines get stripped, btw.

You still ought to keep the changelog in an end-user approachable writing style.

hidden releases

If you can't uncover a suitable source for $changes= then your automated release submission will be classified as hidden. Thus the project entry will stay current, but no frontpage listing (or notification) will occur.

The regex module will also likely be rate limited, so won't rescan your website daily.

interval= rule

All Autoupdate modules additionally support the interval = 7 rule; the number specifying a minimum amount of days before any new release lookup is attempted.