ml2json mail archive processor

Usage

Help text

Skip to instructions below to see a recipe.

$ ./ml2json --help
ml2json sourcedir-or-file(s) [--json-to targetfile]

  sourcedir should be a directory with *.mbox files directly in it
  (but see --recurse).  If a file is given instead, it is assumed to
  be an mbox file.  Paths to maildirs (with cur and new
  subdirectories) or ezmlm archive directories are auto-detected and
  treated accordingly.

  All options and actions except for --deidentify, --open-message-id
  and --open-identify (and --help and --config) can alternatively be
  configured from a config file, see --config. Command line options
  have precedence. Sourcepath(s) can be configured as a 'sourcepaths'
  config key holding an array. Some options can *only* be configured
  in the config file.

  Sourcepaths can be files ending in .gz or .bz2, in which case they
  are uncompressed on the fly. (Note that this leads to different
  identify values, though.)

  The --json-to, --html-to and --source-to actions can be combined in
  the same run.


  Actions:

    --json-to file
                where the JSON output should go (use '-' for stdout).

    --html-to dir
                create html files in the specified directory
                (currently meant for debugging only)

    --html-index file
                create html file with thread 'index'; --html-to should
                be given, too, so that the link targets (currently
                assumed to be in the same directory) are available

    --source-to dir
                create files <dir>/<md5hash>/<number>.txt (where
                <md5hash>/<number> is the same string that
                --deidentify takes) and link them in the html files
                produced with --html-to

    --open-message-id message-id
                run $BROWSER on the given message that was written
                before using --html-to. Needs html-to option again to
                know where the files are, but doesn't regenerare files
                nor does it run json generation.

    --open-identify identifystring
                run $BROWSER on the given message that was written
                before using --html-to. Needs html-to option again to
                know where the files are, but doesn't regenerare files
                nor does it run json generation.

    --deidentify string
                print message identified by the given string, which is
                output into the JSON by the json_identify method (as
                'identify' field by default) or printed along with
                WARN and ERROR messages.
                Only works if the generated files are still available.
                If --attachment-basedir was given before, it has to
                be specified for --deidentify as well.

    --show-mbox-path md5hexstring[/number]
                show path of original mbox file for a given mbox
                identifier (as they are used in identify strings).

    --cleanup   delete currently used temp directory at /tmp/ml2json*


  Options:

    --config myconfig.pl
                load given file containing perl code, which must end
                with a perl hashref; see default configuration in
                './default_config.pl'
                for the options that can be set that way.  Several
                --config options can be given, each subsequent one
                overrides options loaded from the previous ones (with
                regards to the keys of the top level hash ref). The
                previously loaded config can be accessed from
                $main::config.  Keys in the config file use
                underscore to separate words, not '-'.

    --verbose   show NOTEs in addition to WARNings on stderr

    --recurse   recurse into subdirectories of sourcedir-or-file

    --mbox-glob globstring
                glob to use to find files in directories that contain
                mboxes. Defaults to '*.mbox'.


    --attachment-basedir path
                use path instead of a random subdirectory below
                '/tmp/ml2json' for the output; the output currently
                contains serialized objects as well, but those don't
                hurt, do they?
                If path is absolute, then the url field for
                attachments is output as file:// URI, if it is
                relative, it is output as a relative URI (i.e. no
                file:// prefix).

    --max-thread-duration duration[1]
                When encountering emails with no or no known
                in-reply-to and references headers, group them into
                the same thread according to their subject line as
                long as the time span between the first mail of that
                subject and the last one doesn't exceed the given
                duration. Pass '0' to disable. Default: '1 month'.

    --max-date-deviation duration[1]
                When mbox separators (lines starting with 'From ')
                contain time stamps, and those deviate more than
                <duration> from the Date header contained in the mail
                (of if there is no Date header in the mail), use the
                mbox time stamp instead. Default: off.  Note: mbox
                time stamps are not necessarily representing the time
                when an email was received, could also be when they
                were copied around.

   --filter-max-age duration[1]
                Only output messages which have a date/time more
                recent than the current time minus the given duration.
                When using this option, the in-reply-to fields, or the
                links in the generated pages with --html-to, can point
                to emails that are known but not written to the json
                stream or html-to directory.

    --jobs n
                use n instead of the default 2[2] jobs in parallel

  [1] duration can be anything that Time::Duration::Parse supports,
      like '1 day' or '1d and 5h'; bare numbers are interpreted as
      seconds.

  [2] default number of jobs is derived from the number of cores on
      the machine the program is running on.

  (Option names can be shortened when given as command arguments (not
  when given in config files) as long as they are unambiguous and that
  you accept the risk for future ambiguity.)

  (Christian Jaeger <ch@christianjaeger.ch>)

Instructions

  1. collect all mbox files or Maildir/ezmlm directories your archive is comprised of into one or several directories (or directory tree, if you use the --recurse option), possibly using symlinks.

  2. decide upon a base directory where all the unpacked attachments (as well as serialized state--for details see phases) should be stored; if you don't care about accessing the attachments, some directory under /tmp will be choosen. ml2json will create a symlink at ~/.ml2json-tmp which points to that generated directory, so that subsequent runs of ml2json will find it again and can omit part of the work that was already done. If you do care about the generated attachments, specify the --attachment-basedir option.

  3. it's possible to customize what fields are output in the JSON by using a config file. You can also put most commandline parameters into a config file, thus if you want to run a particular conversion repeatedly and consistently, consider doing that. The file default_config.pl has the defaults; this file is not meant to be edited, instead, create a new file that shadows the config keys that you want to set, then pass the path to it to the --config option. An example can be found in ml2json-list-generate/config.pl which is used to generate the list archive. Read about --config in ./mk2json --help.

  4. run ./mk2json sourcedir --json-to targetfile, perhaps with the additional options of your choice (in particular you need to use the --mbox-glob option, if the files are not named according to the default mbox glob pattern, *.mbox; use * to make it look at all files).

  5. the temp / attachments dir is structured as follows:

    $attachment_basedir/$hash_of_mailbox_path/$i/<files>
    

    $hash_of_mailbox_path is the mailbox_path_hash config setting applied to the path to the mailbox; this to shorten down the path to something that won't ever conflict. This does not necessarily hide the original path: it's both possible to determine the original path by using MD5 hash crackers if the default hash config setting isn't changed, and if the $attachment_basedir/$hash_of_mailbox_path/__meta file is still present, the path can be read from it.

    If the cache_dir option is set, then no __meta files will be created within $attachment_basedir (making it possible to use it cleanly for serving in public html archives).

    If you want to know which mailbox path a particular hash originated from, use the ml2json --show-mbox-path option.

    $i is a string indicating the position of the email in the mailbox, or $o-$p in case of ezmlm archive dirs, where $o is the ezmlm subdir and $p the message file name within the subdir (both being non-negative integers, $p possibly with a leading zero).

    You can run ml2json --deidentify "$hash_of_mailbox_path/$i" to make it print the original message string (as it was cut out of the mbox file, or copied from a Maildir file).

  6. optionally, to clean up the generated temporary / attachments files, run ml2json with the --cleanup option; if you gave the --attachment-basedir option before, it has to be given again, otherwise ml2json will just look at ~/.ml2json-tmp (or do nothing if not present).