Contact
Offices

[Howto] Zarafa mail extraction

Damian Lukowski
zarafa_logo.jpgMail administrators which use spamassassin and its Bayesian filter need to train the classifier on a regular basis. Spamassassin's sa-learn utility needs to process full RFC-822 emails including mail headers and the the mail body. The following howto describes a method of how full emails can be extracted from the mailsystem Zarafa.
A Zarafa environment utilizes a MySQL database to maintain its users, mail stores, folders and other "mail objects". Two operating modes exist for Zarafa's attachment-storage:
  • In database operation mode, all objects and object types are stored within the database, especially emails, headers and attachments.
  • In files operation mode, however, potentially large objects, namely attachments and full emails (which include their attachments as multipart blocks), are stored as files on the server's file system.
The following points are particularly interesting:
  • The email header alone is found in the properties database table.
  • Data regarding user accounts are distributed amongst the users and stores table.
  • When in database operation mode, Zarafa stores full emails in the lob table or, if using files operation mode, it stores them on the server's filesystem, typically under /var/lib/zarafa/....
  • Relationships between related data objects of multiple tables are established by instance IDs (SQL key attributes).
In files mode, the complete path of a gzip compressed full email has the form {attachment_path}/{INT}/{INT}/{Instance_ID}.gz, where the attachment_path equals /var/lib/zarafa in Zarafa's standard configuration. Its subfolders derive from the instance ID, so a concrete sample path might be /var/lib/zarafa/4/6/43664.gz. The following script reads a string from standard input and searches for matching email headers in the properties database table. For every hit, the corresponding full email is extracted from the lob database table, so it is important that Zarafa runs with attachment_storage=database. One must consider that the given SQL queries might produce a high database load - depending on its size. The script has been tested with a standard installation of Zarafa 7.0.9. A sample run would be:
[root@zarafa ~]# cat needle.txt
Subject: test Mon, 04 Feb 2013 15:05:12 +0100
[root@zarafa ~]# mkdir hackstack; ./zarafa-extract.pl zarafauser haystack < needle.txt
 Writing haystack/1359986776.0000.txt ...
[root@zarafa ~]# cat haystack/1359986776.0000.txt
Date: Mon, 04 Feb 2013 15:05:12 +0100
To: zarafauser
From: zarafa.testsystem.intern
Subject: test Mon, 04 Feb 2013 15:05:12 +0100
X-Mailer: swaks v20100211.0 jetmore.org/john/code/swaks/
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="----=_MIME_BOUNDARY_000_14806"
 
------=_MIME_BOUNDARY_000_14806
Content-Type: text/plain
 
This is a test mailing
------=_MIME_BOUNDARY_000_14806
Content-Type: application/octet-stream; name="test.xml"
Content-Description: test.xml
Content-Disposition: attachment; filename="test.xml"
Content-Transfer-Encoding: BASE64
 
SGVsbG8gV29ybGQK
The following Perl script uses both DBI and DBD::mysql modules, which are included in the libdbd-mysql-perl package under Debian, for example.
#!/usr/bin/perl -w
 
# Copyright (c) 2013, Damian Lukowski <damian.lukowski@credativ.de>
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#    * Redistributions of source code must retain the above copyright
#      notice, this list of conditions and the following disclaimer.
#    * Redistributions in binary form must reproduce the above copyright
#      notice, this list of conditions and the following disclaimer in the
#      documentation and/or other materials provided with the distribution.
#    * Neither the name of the <organization> nor the
#      names of its contributors may be used to endorse or promote products
#      derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
use strict;
use DBI;
use DBD::mysql;
 
my $DEBUG = 0;
 
# Fill in the MySQL server properties
my ($dbhost,$dbport,$dbname,$dbuser,$dbpass) = ('', 3306, 'zarafa', '', '');
 
if (@ARGV < 2)
{
print "Usage:$0 <zarafa-user> <outdir>\n
$0 reads a mail header from STDIN and searches
for zarafa-users's emails matching the given mail header.
Each result is stored as a separate file within <outdir>.
It is assumed that zarafa is configured with
attachment_storage = database!\n";
exit;
}
 
my ($username, $outdir) = @ARGV;
 
chomp(my $needle = join '', <STDIN>);
$needle = "%$needle%";
# RFC822 demands CR LF as a line separator.
$needle =~ s/(?!\r)\n/\r\n/g;
 
my $dbh = DBI->connect("dbi:mysql:$dbname:$dbhost:$dbport", $dbuser, $dbpass)
  or die $DBI::errstr;
 
my ($qry, $ref, $sth, $uid);
 
if ($username ne '.')
{
 
$qry = "SELECT user_id FROM stores WHERE user_name = '$username'";
$ref = $dbh->selectall_arrayref($qry);
 
if (@$ref > 1)
{
print "Ambiguous username. Matching userids: ",
join ', ', map {$_->[0]} @$ref;
exit;
};
 
die "Username not found\n" unless @$ref;
$uid = $$ref[0][0];
print "uid of $username: $uid\n" if $DEBUG;
};
 
if ($DEBUG && $username ne '.')
{
$qry = "SELECT COUNT(*) FROM properties";
printf "%d properties\n", ${$dbh->selectall_arrayref($qry)}[0][0];
 
$qry = "SELECT COUNT(*) FROM lob";
printf "%d large objects\n", ${$dbh->selectall_arrayref($qry)}[0][0];
 
$qry = "SELECT COUNT(*) FROM properties WHERE tag = 125";
printf "%d properties with tag 125 (headers)\n",
${$dbh->selectall_arrayref($qry)}[0][0];
 
$qry = "SELECT COUNT(*) FROM properties WHERE val_string LIKE ?";
$sth = $dbh->prepare($qry);
$sth->execute($needle);
printf "%d tag-125 properties matching needle\n",
${$sth->fetchrow_arrayref}[0];
 
        $qry = "SELECT h.owner, s.user_name, p.hierarchyid FROM "
             . "hierarchy h RIGHT JOIN properties p on h.id = p.hierarchyid "
             . "LEFT JOIN stores s ON s.user_id=h.owner WHERE p.tag = 125 "
             . "AND p.val_string LIKE ?";
 
$sth = $dbh->prepare($qry);
$sth->execute($needle);
while (my @row = $sth->fetchrow_array())
{
printf " uid=%s, username=%s, hierarchyid=%s\n", @row;
}
 
$qry = "SELECT count(*) FROM hierarchy h JOIN properties p on "
     . "h.id = p.hierarchyid LEFT JOIN singleinstances s on "
     . "s.hierarchyid = h.id WHERE p.tag = 125 AND h.owner = $uid";
printf "%d tag-125 properties of %s\n",
${$dbh->selectall_arrayref($qry)}[0][0], $username;
 
$qry = "SELECT h.owner, p.hierarchyid, s.instanceid FROM hierarchy h "
     . "JOIN properties p on h.id = p.hierarchyid LEFT JOIN "
     . "singleinstances s on s.hierarchyid = h.id WHERE "
     . "h.owner = $uid AND p.val_string LIKE ?";
$sth = $dbh->prepare($qry);
$sth->execute($needle);
$ref = $sth->fetchall_arrayref();
printf "%d tag-125 properties of %s matching needle\n",
(scalar @$ref), $username;
 
for my $row (@$ref)
{
printf " uid=%s, hierarchyid=%s, instanceid=%s\n", @$row;
}
}
 
# A tag value of 125 corresponds to the mapitag definition as found in zarafa's
# sources under mapi4linux/include/mapitags.h:
# define PR_TRANSPORT_MESSAGE_HEADERS          PROP_TAG(PT_TSTRING,    0x007D)
# define PR_TRANSPORT_MESSAGE_HEADERS_W        PROP_TAG(PT_UNICODE,    0x007D)
# define PR_TRANSPORT_MESSAGE_HEADERS_A        PROP_TAG(PT_STRING8,    0x007D)
 
if ($username ne '.')
{
$qry= '(SELECT s.instanceid FROM properties p JOIN hierarchy h '
. 'ON h.id = p.hierarchyid JOIN singleinstances s ON '
. "s.hierarchyid = h.id WHERE h.owner = $uid AND p.tag=125 AND "
. 'p.val_string LIKE ?)';
} else
{
$qry= '(SELECT s.instanceid FROM properties p '
. 'JOIN singleinstances s ON '
. "s.hierarchyid = p.hierarchyid WHERE p.tag=125 AND "
. 'p.val_string LIKE ?)';
}
 
$sth = $dbh->prepare($qry);
$sth->execute($needle);
 
my $iid = undef;
 
# Hint:
# For zarafa deployments with attachment_storage = files, the requested mails
# are not stored in the lob table, but within /var/lib/zarafa/. Let iid be
# 54321, then the mail is stored in /var/lib/zarafa/1/12/54321.gz
# The extraction logic is not covered here ...
 
my $written = 0;
 
while (my @row = $sth->fetchrow_array())
{
$iid = $row[0];
$qry = "SELECT val_binary FROM lob WHERE instanceid = $iid";
$ref = $dbh->selectall_arrayref($qry);
 
print "Looking for full mail of iid=$iid\n" if $DEBUG;
 
for my $row (@$ref)
{
print " Found full mail with iid=$iid\n" if $DEBUG;
my $tmp_file = sprintf '%s/%s.%04d.txt', $outdir,
time, $written++;
open my $fh, "> $tmp_file";
print " Writing $tmp_file ...\n";
print $fh $row->[0];
close $fh;
};
};
 
unless (defined $iid)
{
print "No iid found, probably because the header didn't match.\n";
print "If you are sure that the header should be found, try to use ",
      "'.' (dot) as the username.\nThe DB query will be slower but ",
      "might find the email.\n" unless $username eq '.';
} elsif ($written == 0)
{
print"Instanceid(s) found, but no files written? ",
"Probably attachment_storage != database\n";
}

Add new comment

Image CAPTCHA