Artifact 4828ce5bb5cf7d6b71854dafee20b33e8dfc7854239a6665503baf146f6027ff:

File logfmt1/manpage/logfmt.md — part of check-in [6a4153d11f] at 2021-01-12 22:50:29 on branch trunk — Updated man pages for logfmt1 (user: mario size: 11828)

% log.fmt(5) logfmt1 tools	Version 0.3

.LOG.FMT

A .log.fmt for each log file

Instead logfmt1 aims to have descriptors for each log file, in order to make them parseable. You can't attempt anything but guesswork until you know what's in a file.

So the idea is to have a *.fmt next to each *.log file, with a descriptor such as:

{
   "class": "apache combined",
   "record": "%h %l %u %t \"%r\" %>s %b"
}

Notably the "record" field should be the most current format string that the application itself uses. In order to resolve the placeholders, an application reference is kept in "class". Which allows combining the format string with placeholder field definitions from the global .fmt database (/usr/share/logfmt) database.

common classes

There aren't many predefined classes yet, but special values that could work without a current "record": declaration might be:

"class": "grok syslog" : Reads the according definition from a .grok (or perhaps preconverted) pattern definition. Which are largely static patterns.

"class": "inilog" : For Heroku/Go "logfmt" style logs comprised of only key=value fields

"class": "json appmoniker" : For real JSON logs, with an application identifier here (for decoration)

"class": "apache common" : Reads a predefined/static record: definition from the global apache.common.fmt. Which of course means it would fail to parse, if the user diverted the LogFormat declaration in Apache.

Note that predefined classes undermine the purpose of logfmt1, in that they're only suitable for static/non-variant log formats.

additional fields

The *.log.fmt itself might declare definitions such as aliases and more specific/custom placeholders.

{
   "class": "apache cust3",
   "record": "%a %h %{iso}t '%r' %s",
   "fields": {
       "%{iso}t": { "id": "datetime", "rx": "..." }
   },
   "alias": {
       "iso8601": "datetime",
   }
}

Which ought to be joined and override any global fmt definitions. Though such user customizations are more likely to be applied there anyway. Care should be taken by update-logfmt or applications to not jettison user-customized *.log.fmt options.

rationale

Having the .fmt files adjecent to log files seems the most convenient option.

Appending a .fmt suffix to the ….log filename doesn't obstruct tab completion as much as .fmt substituting .log.
Doesn't require a lookup table or directory, with additional permission or updating woes.
And (over time) enabled applications themselves to create a .log.fmt for each log file. (That's kinda the goal. The update-logfmt scripts are a stop-gap workaround.)

GLOBAL .fmt DATABASE

While each log file should be accompanied by a .fmt descriptor, the global database in /usr/share/logfmt/ contains a full .fmt field definition for each class. And the cross-section of both allows to construct a regex.

Most notably the "fields": and "placeholder": are used to turn the "record": string definition into a capture pattern.

.fmt Example

The Apache format definition (apache.fmt) contains:

{
    "class": "apache generic",
    "separator": " ",
    "rewrite": {
        "%[\\d!,+\\-]+": "%",
        "%%": "%"
    },
    "placeholder": "%[<>]?(?:\\w*\\{[^\\}]+\\})?\\^?\\w+",
    "fields": {
        "%a": {  "id": "remote_addr",    "rx": "[\\d.:a-f]+"  },
        "%h": {  "id": "remote_host",    "rx": "[\\w\\-.:]+"  },
        "%{c}h": { "id": "remote_host",  "rx": "[\\w\\-.:]+"  },
        "%A": {  "id": "local_address",  "rx": "[\\d.:a-f]+"  },
        "%u": {  "id": "remote_user",    "rx": "[\\-\\w@.]+"  },
        "%t": {  "id": "request_time",   "rx": "\\[?(\\d[\\d:\\w\\s:./\\-+,;]+)\\]?"  },
        …
    },
    "alias": {
        "remote_address": "remote_addr",
        "ip": "remote_addr",
        "file": "request_file",
        "size": "bytes_sent",
        …
    },
    "expand": {
        "%\\{([^{}]+)\\}t": {
            "id": "request_time",
            "class": "strftime",
            "record": "$1"
        }
    },
    "container": {
        "message": {
            "id": "$1",
            "value": "$2",
            "rx": "\\[(\\w+) \"(.*?)\"\\]",
            "class": "apache mod_security"
        }
    },
    "glob": ["/var/log/apache*/*acc*.log"]
}

It usually does not describe a default "record" format (like the local .log.fmt descriptors do).

class:

The class in the global database is largely decorative. The filenames instead define the heritage of rules/fields. The "class" as declared by a .log.fmt is mapped onto /usr/share/logfmt/application.variant.fmt.

Usually there's just one variant level per log type. But the lookup is supposed to be mildly recursive.
Essentially it should merge *.log.fmt with appclass.variant.fmt and appclass.fmt applied last, so the most specific definitions are retained.
There's also a generic "grok" class. But the patterns therein are largely static (not build from variable format strings).
Some special classes like "json" might exist. (Not supported by logfmt1)

record:

The "record" entry is not usually present in the global .fmt definition. Some super specific variant definitions (for example apache.error.fmt) or static formats (syslog.fmt) might however.

separator:

Most log formats use spaces for separating %placeholder fields. And simpler implementations might just split up the "record" declaration on this.

placeholder:

While logfmt1 instead uses a regex definition of possible %placeholder strings to map onto fields. It should account for prefixes/suffixes, unless those got cleared by the rewrite map.

Not all formatstrings use %\w+ to signal placeholders. In nginx for instance the sigil $\w+ introduces placeholders (variable names, really).

rewrite:

A list/map of regex to apply before any transformations or field lookups. Which can be used to mask or simplify placeholder definitions (for instance clean up the Apache conditional prefixes) or regex meta characters.

The record field starts as a static string, but is meant to be turned into a regex.
Therefore meta characters (such as | or []) have to be taken care of. Which is what the rewrite map is lazily used for.
Better implementations might look up the placeholders, and automatically escape the rest of the the "record" format string.

fields:

The core of the global .fmt definitions are the field lists. Each defines a static %F placeholder and associaties it with a default field name (id:) and regex (rx:) or even a grok definition (grok:).

key	purpose
`%F`	JSON key: static placeholder string (not a regex itself)
id	field identifier, as specified by the application (internal name)
rx	regex which %F placeholder gets replaced with
grok	alternatively to regex, %F might be turned into %PATTERN:id
type	"int" and "float" could designate strictly numeric fields

As part of the regex transformation, a %F could be turned into (?<id>\S+) for instance.
If there's any unnamed capture group (…), it should be augmented into a named capture group - instead of the whole match. (To account for implicit wrapping.)
The rx itself might however specify named subgroups (like request_line in Apache logs, itself comprised of _method, _path, _protocol, or the datetime made up of tm_wday, tm_year, tm_whatever).
\S+ is also used as fallback for entirely undefined placeholders (no expand definition matched) in logfmt1.
grok isn't currently used, but might allow for simpler transformations (indirectly into a grok pattern, and later a regex).

expand:

The expand declarations are used to construct unknown fields/placeholders. Instead of static %placeholders, each entry describes a regex to detect new/variant placeholders. Thus it simply can be applied before separator/placeholder are looked at, to augment the known fields list.

key	purpose
`%\{(\w+)\}t`	JSON key: a regex to detect mutable placeholders
id	name for newly created fields entry, might use captures´ $1
rx	for static definitions (often just S+)
if_quoted	alternative regex, if placeholder was enclosed in "%w+" quotes
class	recurse into other .fmt types
record	can be set to $2 if class: recursion is defined

Typically it suffices to specify the id and rx field.
If no id is given, then the regex capture is normalized into an identifier (non-alphanumerics stripped, all lowercased).
But the id or record value might be set with regex captures (e.g. $1 or $2) or compound values ("id": "newfield_$1").
And logfmt1 allows to recurse into other format types per class (which is used to expand the captured "record": "$1" into regex tokens).

alias:

Maps alternative/more common field names onto the declared field ids.

To get to some state of standardization, the field ids usually refer to application-internal names. (For instance log_pfn_register(…,…,cb_id) names in Apache). And those aren't always the more commonly used identifiers.

Thus aliases makes sense not just for convenience, but also to be compatible to other common names (e.g. w3c extend log format names like cs-time).

container:

Is utilized by logopen() to extract additional fields (lists even) from one of the existing fields. This is usually done at row traversal. And makes sense for application-specific subformats in logs. Such as any key=value lists in the main message field.

key	purpose
`message`	JSON key: from which field to extract
rx	regex to detect and capture (key)=(value) fields
id	unpacked field name (usually just `$1` from the rx capture
value	value from capture (so `$2` typically)
class	decorative description (no .fmt recursion supported in logfmt1)

The entries here might become lists, since commonly there's just one message field in logs, yet multiple key:value schemes might be utilized within.
Or the target field might become a "extract_from": property, and container a list itself.
Still not sure if automatic list conversion is a good idea. - Standard fields get an enumaration suffix (?<request_uri2>…) if duplicated.

glob:

Might be used by log processors to look up a log class, based on file names, if no .log.fmt is declared.

#comment: fields

Documentation entries in the .fmt files have keys starting with #. For example "#license": or "#origin":. Which is simpler than using JSON with comments (JSOL/JSON5).

Other format files

!!! Note This section is about fictional features.

.grok definitions

Not implemented yet.

The logfmt/ directory might also contain .grok files, which get transformed into .fmt structures. (Probably with the grok: parameter for fields, and a grok: pattern table alongside regular fields:).

There's already a pretransformed grok.fmt, which however requires %{GROK:%{PATTERN:id}} references currently.

.lnav formats