Warning
❮❗❯ This is all very provisional. (First draft. Names might still change.)
Global .fmt database
While each log file should be accompanied by a .fmt descriptor,
the global database in /usr/share/logfmt/
contains a full .fmt field
definition for each class. And the cross-section of both allows to construct
a regex.
Most notably the "fields":
and "placeholder":
are used to turn the
"record":
string definition into a capture pattern.
.fmt Example
The Apache format definition (apache.fmt) contains:
{
"class": "apache generic",
"separator": " ",
"rewrite": {
"%[\\d!,+\\-]+": "%",
"%%": "%"
},
"placeholder": "%[<>]?(?:\\w*\\{[^\\}]+\\})?\\^?\\w+",
"fields": {
"%a": { "id": "remote_addr", "rx": "[\\d.:a-f]+" },
"%h": { "id": "remote_host", "rx": "[\\w\\-.:]+" },
"%{c}h": { "id": "remote_host", "rx": "[\\w\\-.:]+" },
"%A": { "id": "local_address", "rx": "[\\d.:a-f]+" },
"%u": { "id": "remote_user", "rx": "[\\-\\w@.]+" },
"%t": { "id": "request_time", "rx": "\\[?(\\d[\\d:\\w\\s:./\\-+,;]+)\\]?" },
…
},
"alias": {
"remote_address": "remote_addr",
"ip": "remote_addr",
"file": "request_file",
"size": "bytes_sent",
…
},
"expand": {
"%\\{([^{}]+)\\}t": {
"id": "request_time",
"class": "strftime",
"record": "$1"
}
},
"container": {
"message": {
"id": "$1",
"value": "$2",
"rx": "\\[(\\w+) \"(.*?)\"\\]",
"class": "apache mod_security"
}
},
"glob": ["/var/log/apache*/*acc*.log"]
}
It usually does not describe a default "record" format (like the local .log.fmt descriptors do).
class:
The class in the global database is largely decorative. The filenames
instead define the heritage of rules/fields. The "class" as declared by
a .log.fmt is mapped onto /usr/share/logfmt/application.variant.fmt
.
- Usually there's just one variant level per log type. But the lookup is supposed to be mildly recursive.
- Essentially it should merge
*.log.fmt
withappclass.variant.fmt
andappclass.fmt
applied last, so the most specific definitions are retained. - There's also a generic "grok" class. But the patterns therein are largely static (not build from variable format strings).
- Some special classes like "json" might exist. (Not supported by logfmt1)
record:
The "record" entry is not usually present in the global .fmt definition. Some super specific variant definitions (for example apache.error.fmt) or static formats (syslog.fmt) might however.
separator:
Most log formats use spaces for separating %placeholder fields. And simpler implementations might just split up the "record" declaration on this.
placeholder:
While logfmt1 instead uses a regex definition of possible %placeholder
strings to map onto fields. It should account for prefixes/suffixes, unless
those got cleared by the rewrite
map.
Not all formatstrings use %\w+
to signal placeholders. In nginx for instance
the sigil $\w+
introduces placeholders (variable names, really).
rewrite:
A list/map of regex to apply before any transformations or field lookups. Which can be used to mask or simplify placeholder definitions (for instance clean up the Apache conditional prefixes) or regex meta characters.
- The
record
field starts as a static string, but is meant to be turned into a regex. - Therefore meta characters (such as
|
or[]
) have to be taken care of. Which is what therewrite
map is lazily used for. - Better implementations might look up the placeholders, and automatically escape the rest of the the "record" format string.
fields:
The core of the global .fmt definitions are the field lists. Each defines a static %F placeholder and associaties it with a default field name (id:) and regex (rx:) or even a grok definition (grok:).
key | purpose |
---|---|
%F |
JSON key: static placeholder string (not a regex itself) |
id | field identifier, as specified by the application (internal name) |
rx | regex which %F placeholder gets replaced with |
grok | alternatively to regex, %F might be turned into %PATTERN:id |
type | "int" and "float" could designate strictly numeric fields |
Notes
- As part of the regex transformation, a
%F
could be turned into(?<id>\S+)
for instance. - If there's any unnamed capture group
(…)
, it should be augmented into a named capture group - instead of the whole match. (To account for implicit wrapping.) - The
rx
itself might however specify named subgroups (like request_line in Apache logs, itself comprised of _method, _path, _protocol, or the datetime made up of tm_wday, tm_year, tm_whatever). \S+
is also used as fallback for entirely undefined placeholders (no expand definition matched) in logfmt1.grok
isn't currently used, but might allow for simpler transformations (indirectly into a grok pattern, and later a regex).
expand:
The expand declarations are used to construct unknown fields/placeholders.
Instead of static %placeholders, each entry describes a regex to detect
new/variant placeholders. Thus it simply can be applied before
separator/placeholder are looked at, to augment the known fields
list.
key | purpose |
---|---|
%\{(\w+)\}t |
JSON key: a regex to detect mutable placeholders |
id | name for newly created fields entry, might use captures´ $1 |
rx | for static definitions (often just \S+) |
if_quoted | alternative regex, if placeholder was enclosed in "%\w+" quotes |
class | recurse into other .fmt types |
record | can be set to $2 if class: recursion is defined |
Notes
- Typically it suffices to specify the
id
andrx
field. - If no
id
is given, then the regex capture is normalized into an identifier (non-alphanumerics stripped, all lowercased). - But the
id
orrecord
value might be set with regex captures (e.g.$1
or$2
) or compound values ("id": "newfield_$1"
). - And logfmt1 allows to recurse into other format types per
class
(which is used to expand the captured"record": "$1"
into regex tokens).
alias:
Maps alternative/more common field names onto the declared field id
s.
To get to some state of standardization, the field ids usually refer
to application-internal names. (For instance log_pfn_register(…,…,cb_id)
names in Apache). And those aren't always the more commonly used identifiers.
Thus aliases makes sense not just for convenience, but also to be compatible
to other common names (e.g. w3c extend log format names like cs-time
).
container:
Is utilized by logopen() to extract additional fields (lists even) from one
of the existing fields. This is usually done at row traversal. And makes
sense for application-specific subformats in logs. Such as any key=value
lists in the main message field.
key | purpose |
---|---|
message |
JSON key: from which field to extract |
rx | regex to detect and capture (key)=(value) fields |
id | unpacked field name (usually just $1 from the rx capture |
value | value from capture (so $2 typically) |
class | decorative description (no .fmt recursion supported in logfmt1) |
Notes
- The entries here might become lists, since commonly there's just one
message
field in logs, yet multiple key:value schemes might be utilized within. - Or the target field might become a
"extract_from":
property, andcontainer
a list itself. - Still not sure if automatic list conversion is a good idea. -
Standard fields get an enumaration suffix
(?<request_uri2>…)
if duplicated.
glob:
Might be used by log processors to look up a log class, based on file names, if no .log.fmt is declared.
#comment: fields
Documentation entries in the .fmt files have keys starting with #
. For example
"#license":
or "#origin":
. Which is simpler than using JSON with
comments (JSOL/JSON5).
Other format files
Note
This section is about fictional features.
.grok definitions
Not implemented yet.
The logfmt/ directory might also contain .grok files, which get transformed into .fmt structures. (Probably with the grok: parameter for fields, and a grok: pattern table alongside regular fields:).
There's already a pretransformed grok.fmt
, which however requires
%{GROK:%{PATTERN:id}}
references currently.
.lnav formats
Not implemented yet.
Likewise could we use lnav .json format definitions. Those are static too, however.