Migration headaches - users on disk but not showing in Virtualmin

SYSTEM INFORMATION
OS type and version DEBIAN12
Webmin version 2.202
Virtualmin version 7.30.4
Webserver version APACHE
Related packages BACKUP

Hi everyone

Having a lot of headaches over the weekend as I tried multiple times to migrate an old Centos 7.9.2009 server with webmin 2.111 and Virtualmin 7.20.1 to a ne machine having DEbian 12 with webmin 2.202 and Virtualmin 7.30.4- new install/build absolutely fresh

Both of them are LEMP. Nothing fancy in setup

Used the migration scripts ( those 2 commands) to write everything to a NFS mounted partition to both old and new server . Backup worked like a charm. ~110Gb in one hour. I do recommend to disable any kind of backup compression form Webmin, otherwise the backup process took 5-6 hours…

Restore on the other side proved to be tricky:

As stated before the /backups directory was mounted directly in the new server via NFS. Good disk speed around 90-100 Mbps…

Used the CLI command to restore both the setting and the domains. Settings restore went fine, domains restore took 15 minutes doing “something” and then it proceed to restore the domains to their homes. At some point, a “killed” something appeared at the end of the command and the entire process stopped peacefully to the command prompt. No further error shown. Only a part of the domaines were restored

Tough life! Restore an early before restore snapshot ( smart guys work less) and I used the Virtualmin GUI - Restore Virtual servers not the CLI command

Checked everything accordingly, then the same happened:

15 minutes nothing was wrote on the disk but ps ax showed some tar command running for each domain. Then after that it started to show that is restoring everything including the Virtualmin settings

The restore process terminated after all the domains were restored. No errors. No nothing. Just a good damn restore!

On the disk, the homes of every domain were matching the sizes of the old server homes

Surprize surprize, though

On on of the most important domain, even if it consumes 22 Gb of disk/quota, has no users!! No users are showing in the email field. Nada, zip, nulla!

Some other domains show the users. Did not checked each and every one as they are tens of them. P.S LAte edit - they are ok only the main/master domain, who holds also the big nameserver for the other domains, SSL’s, etc is affected. 22GB on disk bot not in Virtualmin

But missing those users makes the entire migration to be untrustful at all. Restarted the old server and that’s all

Tried to debug the new restore. The maildirs are on the /home/xxx/homes/userxxx. user rights are ok. Tried to chown again the directories. PLENTY of space - in fact more then 50% more space is available on the disk. Nada!

Absolutely nothing!

Please if you have the slightest idea… Might be an incompatibility in migration between Centos and Debian? Still… I did this before - but with some webhosting ones not email

Many thanks in advance!!!

This is almost certainly an OOM killer event. i.e. ran out of memory, which is a catastrophic event, something terrible has to happen for the system to survive it (one or more seemingly random processes are killed).

Check the kernel log for proof of that theory (dmesg or it’ll also probably appear in the journal).

If it is the OOM killer, start over and increase swap size on the system before doing whatever led to the OOM killer kicking in.

There will be minor issues, but this isn’t one of them. I generally recommend you stick to the same OS if you can. (In this case, you’d want to pick Rocky or Alma.)

The issues you’ll run into will be with differences between the OS, not really “incompatibilities”.

Your instinct is right. Something went horribly wrong, you shouldn’t expect it to be OK, even if you fix the obvious issues manually. I would recommend starting over (after figuring out why it failed so you can fix it). You should probably figure out what actually went wrong, though.

If you want the fewest possible problems, you’ll go to Rocky or Alma instead of Debian. The issues you run into will be nothing like this, but there will likely be some minor issues.

The symptoms don’t entirely match up with OOM killer being the root cause, but it’s the best guess I’m got, and where you’d look first.

But, yeah, I think my primary advice is figure out the root cause. And, after that, unless you have a strong reason to switch to Debian, stay on the EL train and just upgrade to a current version of Rocky or Alma. Since you clearly like to keep a server for a long time (CentOS 7 is over a decade old!), you should pick an OS with a long lifecycle and Debian has the shortest lifecycle of all of our supported OSes. And, it’s probably going to have fewer problems.

The only way a move to Debian would be better is if you have CGI applications that you need to keep running in the same way. In that case, a move to Debian makes sense, as we don’t maintain custom SuExec packages anymore, like we did until Virtualmin 7. All prior Virtualmin repos provided a custom build so you could run CGI apps out of /home/domainname/cgi-bin with suexec, but Debian/Ubuntu don’t need that as they have a configurable suexec in the suexec-custom package.

Good day Joe!!!

Thank you for your exhaustive and kind answer! Appreciate as always! :slight_smile:

I will proceed with Rocky 9.5 instead of Debian12. About Debian: I’m not sure why I choose it, other that it seemed much more enterprise, than ubuntu&shit, in the days when centos was on a dying verge. It seemed reliable! I still feel it reliable even if I was for many years into RH boat.

About OOM killer - the new machine had 6VCPu and 8GB of RAM, which seemed enough for me. But I will take this into account

Two things, I still need to clarify , if you can and if you have knowledge:

Why that long time of waiting, in the restore process? In which virtualmin is "cat"ing each domain, in background, before it really starts to write the info on disk. For me, each time was about 10-15 minutes for ~110Gb

Why did I had all the info, including the homes of the users and the e-mails, on the disk, but Virtualmin showed me no users in the GUI? How Virtualmin knows about the users belonging to a domain? Was it any workaround to make Virtualmin aware of those users/emails?

Many thanks!

OK, it’s almost certainly not the OOM killer. But, now I don’t have any good guesses about why processes are being killed.

That’s a lot of data. :man_shrugging:

I noticed earlier you were using NFS for the backup. NFS can be very slow, depending on configuration and network speed. There are some locking options and limits that could make it take a huge amount of time to deal with lots of files, as you’d be dealing with in a backup or restore. If you’re trying to uncompress files directly from NFS, I’d recommend you copy the whole archive to a local disk first, and then restore from that, just to rule out NFS performance issues.

Because it failed before the restore completed. I don’t know why. You’ll need to dig in and figure out why. Now that I know you have a reasonable amount of memory, I think it’s probably not the OOM killer, but it’s obviously something, since it was killed before it completed.

There is (Add Servers->Import), but you don’t want to do that unless there’s no other option. What you want is a good and complete restore. You have no way of knowing what else is missing. An Import will involve Virtualmin guessing and applying default server templates and plans to the new account…whatever was in the old account will not be present, because the restore didn’t complete and the user for that backup didn’t make it. Virtualmin doesn’t have the data, so you’ll lose anything custom about the users your import and need to re-apply whatever changes.

I continue to recommend you figure out why it’s failing.

Try running the restore on the command line, if you aren’t already, and show us the exact error(s), not merely a description of it.

Speaking of NFS, if you’re restoring from an NFS volume, it may simply be timing out, again due to misconfiguration or network issues. That would explain things failing seemingly mysteriously, too.

Your NFS server would hopefully have a log or errors and events like that.

Hi Joe,

Thank you for you help

Followed your advices

Rocky Linux 9.5 - 8GB RAM- 8VCPU -fresh install

Copied everything, from NFS backup directory, to another local 120Gb partition - /dev/xvdb1 - all disks SSD. Launched from here the virtualmin settings restore “virtualmin restore-domain --source /root/backups/virtualmin.tar.gz --all-virtualmin” which went fine

Used the CLI command “virtualmin restore-domain --source /root/backups/ --all-domains --all-features” from the SSD folder. also. Took about ~5 minutes to index the backups

Everything went fine, till the last domain - the same problematic domain - the one who hosts all the nameservers

Re-creating virtual server xxxxxx.ro …
Re-allocating user and group IDs …
… allocated user ID is 1082 and group ID is 1018

Saving server details ..
.. done

Creating administration group xxxxxx ..
.. done

Creating administration user xxxxx ..
.. done

Creating aliases for administration user ..
.. done

Adding administration user to groups ..
.. done

Creating home directory ..
.. done

Creating mailbox for administration user ..
.. done

Adding new DNS zone ..
.. done

Adding to email domains list ..
.. done

Adding default mail aliases ..
.. done

Adding new virtual website ..
.. done

Adding webserver user apache to server's group ..
.. done

Killed

DMESG shows last 2 lines:

[ 3520.230773] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-17.scope,task=/usr/libexec/we,pid=15491,uid=0
[ 3520.231294] Out of memory: Killed process 15491 (/usr/libexec/we) total-vm:14780468kB, anon-rss:7063296kB, file-rss:512kB, shmem-rss:0kB, UID:0 pgtables:28972kB oom_score_adj:0

What is even weirder, is the fact, that I kept another console, with htop -d2 opened. At no moment the memory did not exceed 1-1.5Gb out of the 8 GB. The CPU’s indeed were dancing but the memory was low.

Beats my by far!

p.s looking into XCP-ng console at some point a very sharp memory spike went to 8gb and hen fall down to 1.5Gbish.Could this be the problem? What should I do? beside giving it 32 Gb of RAM? :slight_smile: Does it makes sense? SWAP file is also 8GB made by Rocky install,by default

Also I forgot -but when the restore procedure starts I have this error, which is ignored;

… WARNING - The following features were enabled for one or more
domains in the backup, but do not exist on this system. Some
functions of the restored domains may not work : Plugin virtualmin-google-analytics

Thank you!

Again - new server from snapshot - backups were already on a SSD partition. 32GB RAM with 8VPCU

Mounted the backups partition - the settings restore went fine and then it proceeded with the domains

This time by random. the problematic domain was choosen, to be restored, the second:

Re-creating virtual server xxx.ro …
Re-allocating user and group IDs …
… allocated user ID is 1004 and group ID is 1003

Saving server details ..
.. done

Creating administration group xxx ..
.. done

Creating administration user xxx ..
.. done

Creating aliases for administration user ..
.. done

Adding administration user to groups ..
.. done

Creating home directory ..
.. done

Creating mailbox for administration user ..
.. done

Adding new DNS zone ..
.. done

Adding to email domains list ..
.. done

Adding default mail aliases ..
.. done

Adding new virtual website ..
.. done

Adding webserver user apache to server's group ..
.. done

Killed

SAme OOM in DMESG

[ 585.844743] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=/usr/libexec/we,pid=2861,uid=0
[ 585.845223] Out of memory: Killed process 2861 (/usr/libexec/we) total-vm:39549316kB, anon-rss:31725332kB, file-rss:256kB, shmem-rss:0kB, UID:0 pgtables:77448kB oom_score_adj:0

In the hypervisor out of 32GB 16 were eaten and the cpu’s were dancing. No spike larger than 16gb this time. Might be a Virtualmin limitation by script. see the picture

Beats me…

Thank you

More data - the issue is related only to this domain and to the script restore-domain.pl
Tried to restore the domain separately and every time it goes till :
" Adding webserver user apache to server’s group…Done"
and at some point you can see in htop that in ~45 seconds, all the memory is eaten (32GB) then the swap is eaten too (8Gb) then all falls down with OOM.

[ 3308.549431] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-1.scope,task=/usr/libexec/we,pid=18561,uid=0
[ 3308.549988] Out of memory: Killed process 18561 (/usr/libexec/we) total-vm:38215544kB, anon-rss:31965272kB, file-rss:640kB, shmem-rss:0kB, UID:0 pgtables:74832kB oom_score_adj:0

Leaving it for the smartest guys of Virtualmin!

Thank you!

Have you tried to just do a backup of the problem domain and restore it separately?

So, it is the OOM killer! I’m shocked that 8GB is insufficient. Something pathological is happening, but I don’t know what, off-hand. We’ve had people restore bigger backups than this without incident.

And, 32GB proves something pathological is happening.

@Jamie @Ilia have y’all seen this? Something about restoring a backup is exploding memory usage. That seems like it has to be some kind of leak or pathologically recursive data structure or something. A 32GB system should never run out of memory restoring a backup.

Yeah, I think you want to isolate this one domain. Is it something we could look at (I mean Jamie mostly). If so, send us a link via a PM to @staff (click that link and then click “Message”, don’t post it publicly, of course!).

Is there anything unusual about that one problem domain? How big is it? Any features being used that other domains aren’t using?

@Joe, Yes, and we should also look into the domain’s backup tar.gz file by listing its contents sorted by file size using the ls -lsaSh command—my off-hand suspicion is that the issue happens during the database dump restore.

@dragos, could you please untar your backup and list the files sorted by size? Also, try starting the restore process and right afterwards run systemctl status webmin to locate the restore process PID under the CGroup. Then, run strace -p <PID> to gather more details about what’s happening in the process.