modseccfg: Artifact [d20d412496]

Artifact d20d41249614a1a8fca0b52eb4dbae487acf09c215084fb9756321e76223885a:

File logfmt1/docs/fmt.md — part of check-in [9a5ae7b93b] at 2020-12-30 22:54:29 on branch trunk — logfmt1 manual changes (user: mario size: 9202)
!!! Warning
    ❮❗❯ This is all very provisional. (First draft. Names might still change.)


## Global .fmt database

While each log file should be accompanied by a [.fmt descriptor](log.fmt.md),
the global database in `/usr/share/logfmt/` contains a full .fmt field
definition for each class. And the cross-section of both allows to construct
a regex.

Most notably the `"fields":` and `"placeholder":` are used to turn the
`"record":` string definition into a capture pattern.


### .fmt Example

The Apache format definition (apache.fmt) contains:

```json
{
    "class": "apache generic",
    "separator": " ",
    "rewrite": {
        "%[\\d!,+\\-]+": "%",
        "%%": "%"
    },
    "placeholder": "%[<>]?(?:\\w*\\{[^\\}]+\\})?\\^?\\w+",
    "fields": {
        "%a": {  "id": "remote_addr",    "rx": "[\\d.:a-f]+"  },
        "%h": {  "id": "remote_host",    "rx": "[\\w\\-.:]+"  },
        "%{c}h": { "id": "remote_host",  "rx": "[\\w\\-.:]+"  },
        "%A": {  "id": "local_address",  "rx": "[\\d.:a-f]+"  },
        "%u": {  "id": "remote_user",    "rx": "[\\-\\w@.]+"  },
        "%t": {  "id": "request_time",   "rx": "\\[?(\\d[\\d:\\w\\s:./\\-+,;]+)\\]?"  },
        …
    },
    "alias": {
        "remote_address": "remote_addr",
        "ip": "remote_addr",
        "file": "request_file",
        "size": "bytes_sent",
        …
    },
    "expand": {
        "%\\{([^{}]+)\\}t": {
            "id": "request_time",
            "class": "strftime",
            "record": "$1"
        }
    },
    "container": {
        "message": {
            "id": "$1",
            "value": "$2",
            "rx": "\\[(\\w+) \"(.*?)\"\\]",
            "class": "apache mod_security"
        }
    },
    "glob": ["/var/log/apache*/*acc*.log"]
}
```

It usually does not describe a default "record" format (like the local .log.fmt descriptors do).


### class:

The class in the global database is largely decorative.  The filenames
instead define the heritage of rules/fields.  The "class" as declared by
a .log.fmt is mapped onto `/usr/share/logfmt/application.variant.fmt`.

 * Usually there's just one variant level per log type. But the lookup is
   supposed to be mildly recursive.
 * Essentially it should merge `*.log.fmt` with `appclass.variant.fmt` and
   `appclass.fmt` applied last, so the most specific definitions are retained.
 * There's also a generic "grok" class. But the patterns therein are largely
   static (not build from variable format strings).
 * Some special classes like "json" might exist. (Not supported by logfmt1)


### record:

The "record" entry is not usually present in the global .fmt definition. 
Some super specific variant definitions (for example apache.error.fmt) or
static formats (syslog.fmt) might however.


### separator:

Most log formats use spaces for separating %placeholder fields.  And simpler
implementations might just split up the "record" declaration on this.


### placeholder:

While logfmt1 instead uses a regex definition of possible %placeholder
strings to map onto fields. It should account for prefixes/suffixes, unless
those got cleared by the `rewrite` map.

Not all formatstrings use `%\w+` to signal placeholders. In nginx for instance
the sigil `$\w+` introduces placeholders (variable names, really).


### rewrite:

A list/map of regex to apply before any transformations or field lookups. 
Which can be used to mask or simplify placeholder definitions (for instance
clean up the Apache conditional prefixes) or regex meta characters.

 * The `record` field starts as a static string, but is meant to be turned
   into a regex.
 * Therefore meta characters (such as `|` or `[]`) have to be
   taken care of.  Which is what the `rewrite` map is lazily used for.
 * Better implementations might look up the placeholders, and automatically
   escape the rest of the the "record" format string.


### fields:

The core of the global .fmt definitions are the field lists.  Each defines a
static %F placeholder and associaties it with a default field name (id:) and
regex (rx:) or even a grok definition (grok:).

| key | purpose |
|-----|---------|
|`%F`| **JSON key**: static placeholder string (not a regex itself) |
| id | field identifier, as specified by the application (internal name) |
| rx | regex which %F placeholder gets replaced with |
| grok | alternatively to regex, %F might be turned into %PATTERN:id |
| type |  "int" and "float" could designate strictly numeric fields |

!!! Notes
    * As part of the regex transformation, a `%F` could be turned into
      `(?<id>\S+)` for instance.
    * If there's any unnamed capture group `(…)`, it should be augmented
      into a named capture group - instead of the whole match. (To account
      for implicit wrapping.)
    * The `rx` itself might however specify named subgroups (like request_line
      in Apache logs, itself comprised of _method, _path, _protocol, or the
      datetime made up of tm_wday, tm_year, tm_whatever).
    * `\S+` is also used as fallback for entirely undefined placeholders
      (no expand definition matched) in logfmt1.
    * `grok` isn't currently used, but might allow for simpler transformations
      (indirectly into a grok pattern, and later a regex).


### expand:

The expand declarations are used to construct unknown fields/placeholders. 
Instead of static %placeholders, each entry describes a regex to detect
new/variant placeholders.  Thus it simply can be applied before
separator/placeholder are looked at, to augment the known `fields` list.

| key      |  purpose                                               |
|----------|--------------------------------------------------------|
| `%\{(\w+)\}t` | **JSON key**: a regex to detect mutable placeholders  |
| id     | name for newly created fields entry, might use captures´ $1|
| rx     | for static definitions (often just \S+)                  |
|if_quoted| alternative regex, if placeholder was enclosed in "%\w+" quotes|
| class  | recurse into other .fmt types                            |
| record | can be set to $2 if class: recursion is defined          |


!!! Notes
    * Typically it suffices to specify the `id` and `rx` field.
    * If no `id` is given, then the regex capture is normalized into
      an identifier (non-alphanumerics stripped, all lowercased).
    * But the `id` or `record` value might be set with regex captures
      (e.g. `$1` or `$2`) or compound values (`"id": "newfield_$1"`).
    * And logfmt1 allows to recurse into other format types per `class`
      (which is used to expand the captured `"record": "$1"` into regex
      tokens).


### alias:

Maps alternative/more common field names onto the declared field `id`s.

To get to some state of standardization, the field ids usually refer
to application-internal names. (For instance `log_pfn_register(…,…,cb_id)`
names in Apache). And those aren't always the more commonly used identifiers.

Thus aliases makes sense not just for convenience, but also to be compatible
to other common names (e.g. w3c extend log format names like `cs-time`).


### container:

Is utilized by logopen() to extract additional fields (lists even) from one
of the existing fields.  This is usually done at row traversal.  And makes
sense for application-specific subformats in logs.  Such as any `key=value`
lists in the main message field.

| key      |  purpose                                               |
|----------|--------------------------------------------------------|
| `message` | **JSON key**: from which field to extract  |
| rx     | regex to detect and capture (key)=(value) fields         |
| id     |unpacked field name (usually just `$1` from the rx capture|
| value  | value from capture (so `$2` typically)                   |
| class  | decorative description (no .fmt recursion supported in logfmt1) |


!!! Notes
    * The entries here might become lists, since commonly there's just one
      `message` field in logs, yet multiple key:value schemes might be
      utilized within.
    * Or the target field might become a `"extract_from":` property, and
      `container` a list itself.
    * Still not sure if automatic list conversion is a good idea.  -
      Standard fields get an enumaration suffix `(?<request_uri2>…)` if
      duplicated.


### glob:

Might be used by log processors to look up a log class, based on file names,
if no .log.fmt is declared.


### #comment: fields

Documentation entries in the .fmt files have keys starting with `#`. For example
`"#license":` or `"#origin":`. Which is simpler than using JSON with
comments (JSOL/JSON5).

-----

### Other format files

!!! Note
    This section is about fictional features.


#### .grok definitions

> Not implemented yet.

The logfmt/ directory might also contain .grok files, which get transformed
into .fmt structures. (Probably with the grok: parameter for fields, and
a grok: pattern table alongside regular fields:).

There's already a pretransformed `grok.fmt`, which however requires
`%{GROK:%{PATTERN:id}}` references currently.


#### .lnav formats

> Not implemented yet.

Likewise could we use lnav .json format definitions. Those are static
too, however.