On 30/04/2012 19:23, Eitan Adler wrote:
> On 30 April 2012 07:36, Robert Bonomi<bonomi@mail...> wrote:
>> A competennt, "not stupid", sysadmin would know these things. And not
>> 'remove all doubt' (in the words of Abraham Lincoln), by raising such
>> nonsense questions.
> A competent sysadmin would ask questions when they don't know the
> answer bringing up possibilities they thought about.
> A stupid sysadmin would yell at someone asking a question claiming
> they should have known the answer.
I must admit that Robert Bonomi tone was highly insulting for this list,
and though I completely condemn the form of his post, I cannot say I
disagree with the content.
There are quite a lot of things that are wrong with Alejandro Imass'
post and analysis.
The fist thing is that he did not give is setup in one go. It took quite
a while to figure what happened, what system he was using and how he was
At first he had to hard reboot an unresponsive system, then at reboot he
would have lost all of his jail.
Then it appeared that all the jails where inside another jail and that
the unresponsiveness came from MySQL.
Then we learn that all his daemons are inside jails.
Then we learn that ftp-proxy is not.
Then we learned that jail are not handled manually but through EZJail.
Then we are told that the problem with MySQL is known and comes from a
client using TigerCRM with a too much data.
There are litterally dozens of little pieces of important knowledge all
over the thread. And you have to read it all to make sure you have the
global view. Not really a good start.
It is OK to forget to mention a thing or two, discarding what you think
is irrelevant to the problem at hand, but it is not OK to force people
who are trying to help you to read 50+ posts to learn about the basics
of your installation.
What is even more irritating is the fact that Alejandro Imass ignores
pretty much anything that would leads toward a human mistake. Most posts
implying a possible bad use of jails/nullfs/ezjail are ignored or
answered by a simple "I have done everything by the book". Now from my
experience someone with 6 servers, each containing multiple jails will
not do everything by the book every time. It might be that Alejandro is
exceptional, but it is more likely that at least one if not more of
these jails were not made "by the book". Nothing to blame anyone in
here, we all get tired/bored/overconfident sometime - but refusing to
admit the very possibility of a human mistake won't help at all in
finding a solution. Reading the thread I realized that my suggestion
that he might have over-used "ln" had been discarded as "stupid", but
the information came a lot later in answer to another post. Of course in
the mean time I learned that he was using ezjail, which, if I had known
earlier, would have made me wonder if he had not overused nullfs or ln.
He furthermore discarded the possibility saying that he did not think
that ezjail was using links, just nullfs. Well too bad ezjail is
massively using links, at least for basejail, and sometime for port
trees or perl setup depending which guide you are using as your reference.
During the thread he pretty much bashed anyone who tried to tell him
that no amount of jail/ezjail/nullfs/journal screw up could have
resulted in the entire content of the jails being moved into another
completely unrelated directory node. If one jail had moved it would
already have been extraordinary, with a probability of it happening so
cleanly that fsck would find nothing already magnitude of order above
the chances of winning the national lottery. But all of them ? Not a
chance. He finally admitted that he had very little knowledge about UFS
and fsck, but still managed to do it in a quite offensive way.
That was basically the point were I decided to stop to try to help him.
I think others felt the same. This problem is quite interesting in
itself, and I think a lot of the most talented people on this list would
have been on it but were repelled by the attitude.
On the other hand Alejandro Imass pretty much jumped on anything that
would be a third party interaction. From someone hacking into his box to
a potential nullfs bug that might result in a PR.
Now the thing is that EZJail make use of the "system immutable flag"
quite a lot for its config file, resulting in quite a lot of file being
impossible to delete or move unless the box is running at
kern_secure_level 0. This renders the whole "jails moved on their own"
theory even more improbable.
After so much ranting, I would feel bad not to try to help a little :
Here are the facts :
- In a jail, MySQL was grabbing all the CPU and making the box non
responsive. This is due to TigerCRM making requests to a too huge database.
-> The jail was working
-> Unless all the data were in memory at this time
(unprobable), it means that access path/nullfs/EZJail were OK at this time.
- After a force reboot all the jails were gone, or more exactly moved
inside another jail. fsck saw no error on the disk.
-> The disk was in a stable state at reboot, the directory and
file structure was consistent.
- Jails contained it the apache jail were in an OK state and could be
archived and restored
-> The data structure of the hard drive was clean, and files
contents were OK.
From all this here is what we can safely assume :
a) The box was not hacked, or at least the hacker did not move the jails
around, this is confirmed by MySQL working and doing enough I/O to stale
the box from inside a jail that was later seen has moved.
b) The hard-reboot did not cause a problem, it revealed it. Since both
fsck run fine and the data were preserved we can pretty safely assumed
that there was no data or system corruption caused by the hard reboot.
Things to investigate :
- When was the last time this box was rebooted normally ? Did it went
fine ? Were the jails created at this time ?
- What happens if you deactivate the jail that "survived" and reboot
normally, would the other jail contained in it start ? If you deactivate
the jail but leave the nullfs mapping on and try to restart EZJail ? Do
the other jails start ?
- What is the content of the different fstab.* and of the EZJail conf ?
Does any of it points inside the jail that survived the reboot ?
Unfortunately since the server was "corrected" and we probably won't
have a satisfying answer. But honestly the probability of a system bug
is really low. Very likely the "moved" jails were inside the surviving
jail from the beginning, and a mix of nullfs remap and lack of reboot
masked this fact for a while.