painful mupdate syncs between front-ends and database server

Hello, list,

Today we're enjoying our first full work day of independence from the old 
monolithic cyrus server installed in 1999 (Sun 6800 -- it's had new CPU 
boards since then, but that's it), and on our new shiny cluster of T5220's 
that are mostly happily operating as a murder.

I say mostly because while most of the times the thing handles our 80,000 
users and 14,000+ simultaneous connections like a champ, some of the time, 
we get some extreme pain, mostly due to syncs between the MUPDATE master 
and the front-end servers.

When we spec'ed out our servers, we didn't put much I/O capacity into the 
front-end servers -- just a pair of mirrored 10k disks doing the OS, the 
logging, the mailboxes.db, and all the webmail action going on in another 
solaris zone on the same hardware.  We thought this was sufficient given 
the fact that no real permanent data lives on these servers, but it turns 
out that while most of thie time it's fine, if the mupdate processes ever 
decide they need to re-sync with the master, we've got 6 minutes of trouble 
ahead while it downloads and stores the 800k entries in the mailboxes.db.

During these sync periods, we see two negative impacts.  The first is 
lockup on the mailboxes.db on the front-end servers, which slows down both 
accepting new IMAP/POP connections and the reception of incoming messages. 
(The front-ends also accept LMTP connections from a separate pair of 
queueing hosts, then proxy those to the back-ends.)  The second is that, 
because the front-ends go into a

It's awfully frustrating that a system that, as my boss says, performs like 
a Camaro most of the times until you hit a little rock in the road, and it 
suddenly turns into a Pinto.  It's also frustrating that this seems like 
one of the less complicated aspects of the system -- publishing replicas of 
a read-only database to a few worker boxes.

I suppose this is Fastmail and others ripped out the proxyd's and replaced 
them with nginx or perdition.  Currently we still support GSSAPI as an auth 
mechanism, which kept me from going that direction, but given the problems 
we're seeing, I'd be open to architectural suggestions on either how to tie 
perdition or nginx to the MUPDATE master (because we don't have the 
back-ends split along any discernable lines at this point), or suggestions 
on how to make the master-to-frontend propagation faster or less painful.

Sorry for the long message, but it's not a simple problem we're fighting.

Michael Bacon
UNC Chapel Hill 
