I have a FreeBSD 9 system with ZFS root. It is actually a VM under Xen on a beefy piece of HW (4 core Sandy Bridge 3ghz Xeon, total HW memory 32GB -- VM has 4vcpus and 6GB RAM). Mirrored gpart partitions. I am looking for data integrity more than performance as long as performance is reasonable (which it has more than been the last 3 months).
The other "servers" on the same HW, the other VMs on the same, don't have this problem but are set up the same way. There are 4 other FreeBSD VMs, one running email for a one man company and a few of his friends, as well as some static web pages and stuff for him, one runs a few low use web apps for various customers, and one runs about 30 websites with apache and nginx, mostly just static sites. None are heavily used. There is also one VM with linux running a couple low use FrontBase databases. Not high use database -- low use ones.
The troubleseome VM has been running fine for over 3 months since I installed it. Level of use has been pretty much constant. The server runs 4 jails on it, each dedicated to a different bit of email processing for a small number of users. One is a secondary DNS. One runs clamav and spamassassin. One runs exim for incoming and outgoing mail. One runs dovecot for imap and pop. There is no web server or database or anything else running.
Total number of mail users on the system is approximately 50, plus or minus. Total mail traffic is very low compared to "real" mail servers.
Earlier this week things started "freezing up". It might last a few minutes, or it might last 1/2 hour. Processes become unresponsive. This can last a few minutes or much longer. It eventually resolves itself and things are good for another 10 minutes or 3 hours until it happens again. When it happens, lots of processes are listed in "top" as
state. These processes only get listed in these states when there are problems. What are these states indicative of?
Eventually things get going again, these states drop off and the system hums along.
Based on some stuff I found in Google (for a person who had a different but somewhat similar problem) I tried setting
zfs set primarycache=metadata zroot
zfs set primarycache=none zroot
but the problem still happened with approximately the same severity and frequency. (Wanted to see if the system was "churning" with cache upkeep).
What is strange is that this server ran fine for 3 months straight without interruption with the same level of work.
Thanks for any hints or clues
some data points below
# uname -a
FreeBSD newbagend 9.0-STABLE FreeBSD 9.0-STABLE #1: Wed Mar 21 15:22:14 MDT 2012 chad@underhill:/usr/obj/usr/src/sys/UNDERHILL-XEN amd64
# zpool status
scan: scrub repaired 0 in 6h13m with 0 errors on Fri Aug 10 19:33:23 2012