global .fmt db

Warning

❮❗❯ This is all very provisional. (First draft. Names might still change.)

Global .fmt database

While each log file should be accompanied by a .fmt descriptor, the global database in /usr/share/logfmt/ contains a full .fmt field definition for each class. And the cross-section of both allows to construct a regex.

Most notably the "fields": and "placeholder": are used to turn the "record": string definition into a capture pattern.

.fmt Example

The Apache format definition (apache.fmt) contains:

{
    "class": "apache generic",
    "separator": " ",
    "rewrite": {
        "%[\\d!,+\\-]+": "%",
        "%%": "%"
    },
    "placeholder": "%[<>]?(?:\\w*\\{[^\\}]+\\})?\\^?\\w+",
    "fields": {
        "%a": {  "id": "remote_addr",    "rx": "[\\d.:a-f]+"  },
        "%h": {  "id": "remote_host",    "rx": "[\\w\\-.:]+"  },
        "%{c}h": { "id": "remote_host",  "rx": "[\\w\\-.:]+"  },
        "%A": {  "id": "local_address",  "rx": "[\\d.:a-f]+"  },
        "%u": {  "id": "remote_user",    "rx": "[\\-\\w@.]+"  },
        "%t": {  "id": "request_time",   "rx": "\\[?(\\d[\\d:\\w\\s:./\\-+,;]+)\\]?"  },
        …
    },
    "alias": {
        "remote_address": "remote_addr",
        "ip": "remote_addr",
        "file": "request_file",
        "size": "bytes_sent",
        …
    },
    "expand": {
        "%\\{([^{}]+)\\}t": {
            "id": "request_time",
            "class": "strftime",
            "record": "$1"
        }
    },
    "container": {
        "message": {
            "id": "$1",
            "value": "$2",
            "rx": "\\[(\\w+) \"(.*?)\"\\]",
            "class": "apache mod_security"
        }
    },
    "glob": ["/var/log/apache*/*acc*.log"]
}

It usually does not describe a default "record" format (like the local .log.fmt descriptors do).

class:

The class in the global database is largely decorative. The filenames instead define the heritage of rules/fields. The "class" as declared by a .log.fmt is mapped onto /usr/share/logfmt/application.variant.fmt.

Usually there's just one variant level per log type. But the lookup is supposed to be mildly recursive.
Essentially it should merge *.log.fmt with appclass.variant.fmt and appclass.fmt applied last, so the most specific definitions are retained.
There's also a generic "grok" class. But the patterns therein are largely static (not build from variable format strings).
Some special classes like "json" might exist. (Not supported by logfmt1)

record:

The "record" entry is not usually present in the global .fmt definition. Some super specific variant definitions (for example apache.error.fmt) or static formats (syslog.fmt) might however.

separator:

Most log formats use spaces for separating %placeholder fields. And simpler implementations might just split up the "record" declaration on this.

placeholder:

While logfmt1 instead uses a regex definition of possible %placeholder strings to map onto fields. It should account for prefixes/suffixes, unless those got cleared by the rewrite map.

Not all formatstrings use %\w+ to signal placeholders. In nginx for instance the sigil $\w+ introduces placeholders (variable names, really).

rewrite:

A list/map of regex to apply before any transformations or field lookups. Which can be used to mask or simplify placeholder definitions (for instance clean up the Apache conditional prefixes) or regex meta characters.

The record field starts as a static string, but is meant to be turned into a regex.
Therefore meta characters (such as | or []) have to be taken care of. Which is what the rewrite map is lazily used for.
Better implementations might look up the placeholders, and automatically escape the rest of the the "record" format string.

fields:

The core of the global .fmt definitions are the field lists. Each defines a static %F placeholder and associaties it with a default field name (id:) and regex (rx:) or even a grok definition (grok:).

key	purpose
`%F`	JSON key: static placeholder string (not a regex itself)
id	field identifier, as specified by the application (internal name)
rx	regex which %F placeholder gets replaced with
grok	alternatively to regex, %F might be turned into %PATTERN:id
type	"int" and "float" could designate strictly numeric fields

Notes

As part of the regex transformation, a %F could be turned into (?<id>\S+) for instance.
If there's any unnamed capture group (…), it should be augmented into a named capture group - instead of the whole match. (To account for implicit wrapping.)
The rx itself might however specify named subgroups (like request_line in Apache logs, itself comprised of _method, _path, _protocol, or the datetime made up of tm_wday, tm_year, tm_whatever).
\S+ is also used as fallback for entirely undefined placeholders (no expand definition matched) in logfmt1.
grok isn't currently used, but might allow for simpler transformations (indirectly into a grok pattern, and later a regex).

expand:

The expand declarations are used to construct unknown fields/placeholders. Instead of static %placeholders, each entry describes a regex to detect new/variant placeholders. Thus it simply can be applied before separator/placeholder are looked at, to augment the known fields list.

key	purpose
`%\{(\w+)\}t`	JSON key: a regex to detect mutable placeholders
id	name for newly created fields entry, might use captures´ $1
rx	for static definitions (often just \S+)
if_quoted	alternative regex, if placeholder was enclosed in "%\w+" quotes
class	recurse into other .fmt types
record	can be set to $2 if class: recursion is defined

Notes

Typically it suffices to specify the id and rx field.
If no id is given, then the regex capture is normalized into an identifier (non-alphanumerics stripped, all lowercased).
But the id or record value might be set with regex captures (e.g. $1 or $2) or compound values ("id": "newfield_$1").
And logfmt1 allows to recurse into other format types per class (which is used to expand the captured "record": "$1" into regex tokens).

alias:

Maps alternative/more common field names onto the declared field ids.

To get to some state of standardization, the field ids usually refer to application-internal names. (For instance log_pfn_register(…,…,cb_id) names in Apache). And those aren't always the more commonly used identifiers.

Thus aliases makes sense not just for convenience, but also to be compatible to other common names (e.g. w3c extend log format names like cs-time).

container:

Is utilized by logopen() to extract additional fields (lists even) from one of the existing fields. This is usually done at row traversal. And makes sense for application-specific subformats in logs. Such as any key=value lists in the main message field.

key	purpose
`message`	JSON key: from which field to extract
rx	regex to detect and capture (key)=(value) fields
id	unpacked field name (usually just `$1` from the rx capture
value	value from capture (so `$2` typically)
class	decorative description (no .fmt recursion supported in logfmt1)

Notes

The entries here might become lists, since commonly there's just one message field in logs, yet multiple key:value schemes might be utilized within.
Or the target field might become a "extract_from": property, and container a list itself.
Still not sure if automatic list conversion is a good idea. - Standard fields get an enumaration suffix (?<request_uri2>…) if duplicated.

glob:

Might be used by log processors to look up a log class, based on file names, if no .log.fmt is declared.

#comment: fields

Documentation entries in the .fmt files have keys starting with #. For example "#license": or "#origin":. Which is simpler than using JSON with comments (JSOL/JSON5).

Other format files

Note

This section is about fictional features.

.grok definitions

Not implemented yet.

The logfmt/ directory might also contain .grok files, which get transformed into .fmt structures. (Probably with the grok: parameter for fields, and a grok: pattern table alongside regular fields:).

There's already a pretransformed grok.fmt, which however requires %{GROK:%{PATTERN:id}} references currently.

.lnav formats

Not implemented yet.

Likewise could we use lnav .json format definitions. Those are static too, however.