Beancounter FAILCNT issue - server stops

My Virtualmin GPL stops randomly at various times of the day and night. I’m not seeing anything in the logs. Apache does NOT stop, but all other services do including named, webmin, saslauth. etc. I’m running COS6.6 and Virtualmin 5.03. I have 4GB ram, but the only errors I find are Failcnt for privvmpages in beancounters. From what I’ve read this is related to lack of RAM, but the numbers don’t support that. What am I missing?

Version: 2.5 uid resource held maxheld barrier limit failcnt 10086860: kmemsize 43551524 45789054 9223372036854775807 9223372036854775807 0 lockedpages 0 0 9223372036854775807 9223372036854775807 0 privvmpages 535592 582425 2097152 2097152 61 shmpages 1206 1206 9223372036854775807 9223372036854775807 0 dummy 0 0 9223372036854775807 9223372036854775807 0 numproc 134 138 32567 32567 0 physpages 292724 334227 9223372036854775807 9223372036854775807 0 vmguarpages 0 0 1048576 9223372036854775807 0 oomguarpages 292725 334228 9223372036854775807 9223372036854775807 0 numtcpsock 42 44 9223372036854775807 9223372036854775807 0 numflock 242 245 9223372036854775807 9223372036854775807 0 numpty 1 1 255 255 0 numsiginfo 2 3 1024 1024 0 tcpsndbuf 1844960 1880032 9223372036854775807 9223372036854775807 0 tcprcvbuf 696896 729664 9223372036854775807 9223372036854775807 0 othersockbuf 316096 1524640 9223372036854775807 9223372036854775807 0 dgramrcvbuf 0 32096 9223372036854775807 9223372036854775807 0 numothersock 193 222 9223372036854775807 9223372036854775807 0 dcachesize 2841694 2972215 9223372036854775807 9223372036854775807 0 numfile 13820 14459 9223372036854775807 9223372036854775807 0 dummy 0 0 0 0 0 dummy 0 0 0 0 0 dummy 0 0 0 0 0 numiptent 61 61 9223372036854775807 9223372036854775807 0

Howdy,

OpenVZ can be a bit rough when it comes to resource limits.

It does look like you’re running into memory limits there… with OpenVZ, if you are using what that call burstable (non-guaranteed) RAM, it’s possible that a process could be killed off if another Virtual Machine on the same host requires RAM.

To resolve that, you’d either need to increase how much RAM allocated to your Virtual Machine, or free up some more available RAM. Or ask your provider for all guaranteed RAM.

-Eric

I have 4GB guaranteed with another 4GB burst. But I will check with the provider that I am getting 4/4
total used free shared buffers cached
Mem: 4096 2119 1976 0 0 0
-/+ buffers/cache: 2119 1976
Swap: 0 0 0

If you dont have 15+ WP websites or one but with the code leaking everywhere i would blame the host. Still you didnt provide too much info so its hard to say. Before anything else if you have WP or any other CMS with nulled/hacked themes or plugins you can stop right now because you found the reason what is happening - you got hacked.

If this is not your case than:

  1. First thing check Apache and MySQL logs. Maybe is worth to take a look at messages and secure logs. Use some free service to monitor your server with 1 min interval, e.g. uptimedoctor.com is good but you have many others. Just pay attention to have 1 min interval. It will show you precisely when your server went down and make easier to check the log files.

  2. Google up mysqltuner, save to your server (/root) and execute. This will show you in what state is your DB with suggestion what to change. If you are not SysAdmin i would strongly suggest to google the results before any changes. If you are using Innodb instead MyISAM then before any changes please read this http://stackoverflow.com/questions/3927690/howto-clean-a-mysql-innodb-storage-engine/4056261#4056261

  3. If you have WP pay special attention to apache log and see if you can find how many direct hits you got on login page or xmlrpc.php (especially the second one). If your login page have direct access you must block this with htaccess (less complicated) or changing the name and location of your login file and admin folder (much more complicated and prone to errors). For the xmlrpc.php if you dont use JetPack its pretty safe to block all access to this file.

This is what i first thought after reading your post. Can you SSH to your server, check your memory, disk and cpu usage and post it here. More information you post easier will be for someone to check and maybe give you some directions what to do. Dont forget to say what you have on that server because as i said earlier in case of CMS it could be some theme or plugin with bad code consuming all your memory.

For the sake of your health before you change anything please make a local copy of your websites and all files you intend to modify.

Burst RAM is just optional and should be used for rare cases where you need a little more than your guaranteed RAM and for short amount of time. Counting on burstable ram as something free and available 24/7 will not do any good. If you need more than 4GB then you should buy it.

Dedicated RAM: RAM you are guaranteed access to at all times.
Burst RAM: RAM you can access if no one else is using it.

To make it simple, burst RAM is actually RAM “stolen” or “borrowed” (however you like to call) from other accounts. If your host say that will guarantee 4/4 you should imeditelly leave and find better one because i’m pretty sure they are overselling that server as hell. If host can guarantee all 4GB of burst ram all the time then why not offer you 8GB without burst RAM for the same price?

I run about 80 virtual servers and I’d guess that 80% have Wordpress or Joomla or eCommerce CMS. System is CentOS Linux 6.6, Webmin 1.801, Virtualmin 5.03. 4GB gauranteed RAM.
I do monitor the server and know at exactly what time the services stop, but there is nothing in logs for messages, Webmin, mysql (and others) that give me a clue. Here are some of the log items - you can see from the messages when the server stops at 4.46 and starts at 5.50:

Jun 22 05:42:14 ip-68-178-130-21 clamd[19509]: SelfCheck: Database status OK.
Jun 22 05:42:30 ip-68-178-130-21 named[9904]: client 12.161.74.244#42513: query (cache) ‘143.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:42:30 ip-68-178-130-21 named[9904]: client 12.161.74.244#42513: query (cache) ‘143.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:43:26 ip-68-178-130-21 named[9904]: client 12.161.74.244#35101: query (cache) ‘146.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:43:26 ip-68-178-130-21 named[9904]: client 12.161.74.244#35101: query (cache) ‘146.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:44:23 ip-68-178-130-21 named[9904]: client 12.161.74.244#22515: query (cache) ‘147.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:44:23 ip-68-178-130-21 named[9904]: client 12.161.74.244#22515: query (cache) ‘147.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:45:19 ip-68-178-130-21 named[9904]: client 12.161.74.244#5546: query (cache) ‘148.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:45:19 ip-68-178-130-21 named[9904]: client 12.161.74.244#5546: query (cache) ‘148.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:46:15 ip-68-178-130-21 named[9904]: client 12.161.74.244#46301: query (cache) ‘149.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:46:15 ip-68-178-130-21 named[9904]: client 12.161.74.244#46301: query (cache) ‘149.126.10.10.in-addr.arpa/PTR/IN’ denied
Jun 22 05:50:03 ip-68-178-130-21 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Jun 22 05:50:03 ip-68-178-130-21 rsyslogd: [origin software=“rsyslogd” swVersion=“5.8.10” x-pid=“27953” x-info=“http://www.rsyslog.com”] start
Jun 22 05:50:09 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 27860 due to rate-limiting
Jun 22 05:50:10 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 38 messages from pid 27860 due to rate-limiting
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: clamd daemon 0.99.1 (OS: linux-gnu, ARCH: x86_64, CPU: x86_64)
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Running as user clam (UID 497, GID 498)
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Log file size limited to 4294967295 bytes.
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Reading databases from /var/lib/clamav
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Not loading PUA signatures.
Jun 22 05:52:15 ip-68-178-130-21 clamd[28630]: Bytecode: Security mode set to “TrustSigned”.
Jun 22 05:52:25 ip-68-178-130-21 clamd[28630]: Loaded 4543230 signatures.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: TCP: Bound to [127.0.0.1]:3310
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: TCP: Setting connection queue length to 30
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: LOCAL: Removing stale socket file /var/run/clamav/clamd.sock
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: LOCAL: Unix socket file /var/run/clamav/clamd.sock
Jun 22 05:52:26 ip-68-178-130-21 clamd[28630]: LOCAL: Setting connection queue length to 30
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: Global size limit set to 104857600 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: File size limit set to 26214400 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: Recursion level limit set to 16.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: Files limit set to 10000.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxEmbeddedPE limit set to 10485760 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxHTMLNormalize limit set to 10485760 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxHTMLNoTags limit set to 2097152 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxScriptNormalize limit set to 5242880 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxZipTypeRcg limit set to 1048576 bytes.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxPartitions limit set to 50.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxIconsPE limit set to 100.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: MaxRecHWP3 limit set to 16.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: PCREMatchLimit limit set to 10000.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: PCRERecMatchLimit limit set to 5000.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Limits: PCREMaxFileSize limit set to 26214400.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Archive support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Algorithmic detection enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Portable Executable support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: ELF support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Detection of broken executables enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Mail files support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: OLE2 support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: PDF support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: SWF support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: HTML support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: XMLDOCS support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: HWP3 support enabled.
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Heuristic: precedence enabled
Jun 22 05:52:26 ip-68-178-130-21 clamd[28635]: Self checking every 600 seconds.

The other messages I’ve been seeing is:

Jun 20 06:04:32 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 15492 due to rate-limiting
Jun 20 06:04:35 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 62 messages from pid 15492 due to rate-limiting
Jun 20 09:59:56 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 15492 due to rate-limiting
Jun 20 10:02:37 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 40 messages from pid 15492 due to rate-limiting
Jun 20 12:47:36 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 23766 due to rate-limiting
Jun 20 12:47:39 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 23826 due to rate-limiting
Jun 20 12:47:45 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 410 messages from pid 23826 due to rate-limiting
Jun 20 12:51:56 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 24509 due to rate-limiting
Jun 20 12:52:00 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 24562 due to rate-limiting
Jun 20 12:52:06 ip-68-178-130-21 rsyslogd-2177: imuxsock lost 437 messages from pid 24562 due to rate-limiting

Sorry, I mis-typed. I only have 4GB RAM guaranteed so yes, I only have $gb. There is also a burst of 4GB, not guaranteed.But I have not been hitting 4, average usage 2-3 gb, so I don’t think memory issue.

Jun 22 05:46:15 ip-68-178-130-21 named[9904]: client 12.161.74.244#46301: query (cache) ‘149.126.10.10.in-addr.arpa/PTR/IN’ denied

Check named.conf, this message usually comes when named is missing some directives and while there check if you disabled recursion.

Jun 22 05:50:09 ip-68-178-130-21 rsyslogd-2177: imuxsock begins to drop messages from pid 27860 due to rate-limiting

If i’m not mistaken this is rsyslog limiting messages in your log. That means something must flooding your logs as rate limiting will happen if you have more than 100-200 messages in just few seconds. Good start would be to find the process behind each PID and then should be much easier to find the problem. From the logs you posted here in just 6 hours rsyslog drop more than 1000 lines and lack of any useful information in logs could be because rsyslog is dropping them.

Last but not least, 80 accounts of which 80% run on CMS i would say its overkill for the server. If everything is configured perfectly so you can fully utilize your server resources and majority of CMS are optimized and with low traffic then it could be possible to have all this without problems but i’m not sure if this is your case.

You can do this for the meantime

increase the messages allowed and the time interval before rate-limiting occurs in rsyslog. To do this, locate the rsyslog.conf and/or rsyslog.early.conf (usually in /etc) and add the following lines:

$SystemLogRateLimitInterval 10
$SystemLogRateLimitBurst 500

This is 500 messages for the span of 10 seconds before being limited

You may have been a brute force victim and with 80 virtualhost, that is 80x the amount of logs.

The first thing you can do is move SSH to port higher than 1500 since most bots don’t attempt to scan those ports. This will technically solve a lot of issues.
For the domain, disable recursion, you seem to have it enabled or only allow it from a limited IP address/host

Thanks for the response. I’ve added the rating code and I’ll see how that goes.
I’ll also move the SSH port - always meant to do that anyway.
I already had BIND set to “Allow recursive queries from localhost” . I’m wondering if I don’t have it set correctly?
I have in named:
options {
allow-recursion {
localnets;
localhost;
};

I tried to make the recursive queries more strict, but it stopped all mail out.
I have in named:
options {
allow-recursion {
localnets;
localhost;
};

If I set to NO for “Do full recursive lookups for clients?” - BIND > Miscellaneous Options
Set "Send outgoing mail via host [dedrelay.secureserver.net] “- Postfix >General Options (my relay required for all mail”

All outgoing mail stops.If I set Do full recursive lookups for clients back to default, mail sends again.

What are the setup options for??? “Map for allowed addresses for relaying” - in POSTFIX SMTP Server Options

Ant to top it all off, the server totally stopped at 6.03 AM today. Last log file was at 12.23AM

For named.conf you can try this:

acl “trusted” {
localhost;
localnets;
};

options {

… some stuff …

version "unknown";
allow-transfer { trusted; };
allow-recursion { trusted; };
allow-query-cache { trusted; };
recursion no;
additional-from-cache no;
allow-query { any; };

… some more stuff …

	};

};

logging {
channel default_debug {
file “/var/named/data/named.run”;
severity dynamic;
print-category yes;
print-severity yes;
print-time yes;
};
};

Its missing part of the code but this should be enough to give you some idea what to do, for the rest best is to read about named.conf and fine tune depending on what you need, e.g. in case you have master slave you have to do some minor modifications.

Whatever you do keep in mind to disable recursion or limit to specific IP(s) or your server could be used for DDoS amplification attacks and thats pretty bad.

Thanks. I’ll give this a go and see where I end.