Blocking access to directory contents

hello -

for years now i have been storing millions of files from clients and the clients customers like so:

…/public_html/AccessableFiles/clientsEmail@email.com/customersEmail@email.com/valuableInformation.txt

recently, somehow Google got hold of the “AccessableFiles” directory even though it is not referenced anywhere on the primary website, and now all the email addresses have been exposed. :grimacing:

i went through and started putting in a redirect index.html file into every directory, and also added the following to httpd.conf:

<Directory /var/www/AccessableFiles>
     Options -Indexes
</Directory>

in addition, i am thinking for every directory to maybe “force” a 401 page-not-found error or something like that.

ideally, i would like the customers clients files available if AND ONLY IF they know the exact directory names and the exact file name, but no other access.

It would be nice if somebody took the entire URL and removed the file name and tried to access, a 401-page-not-found error was thrown.

i realize this is probably easy, but i cannot seem to figure this out under pressure.

any other suggestions?

SYSTEM INFORMATION
OS type and version: rocky linux
Webmin version: 1.981
Virtualmin version: 6.16

You can block Google’s access to any directory with your robots.txt file or a firewall rule.

thanks gomez. i thought robots.txt only blocked the search engine indexing?

when you say firewall rule, do you mean putting something into the (dot)access file or doing something with csf? i see “View iptable Rules” on the csf page but i suspect this is not what you meant.

Google has a dozen bots that will search and can archive pretty much anything it is given access to. That’s how it got your email addresses. They’re in a public_html folder that you gave them access to. Anything it sees as a public file can be indexed and published.

Even if you have old html pages in there that are no longer used or linked to they can still be indexed and published by Google.

A simple entry blocking bots to the public_html/accessablefiles folder would block all their bots at that point, and any other bots as well.

As far as a firewall rule I use Cloudflare for that. Just write a rule blocking anything with Google’s name on it from that directory and boom. Done.

i created ONE of these index.html files:

<html>
<head>
<meta name="googlebot" content="noindex" />
<meta http-equiv="refresh" content="0; URL=https://www.google.cp\om/" />
</head>
<body>&nbsp;
</body>
</html>

then created a hundred thousand symbolic links back to this one index.html file. i did something like:

find ./secretdirectory/ -type d -exec ln -s /location/to/index.html {} \;

that seems to have fixed it, for now. there probably is some way in .htaccess to create something there, but for now, this was the easiest solution.

@edwardsmarkf simplfied solution would be with htaccess file to lock folder or folders behind password and user name but I get it everyone hates password and username.

you could keep links as they are no changes to anything, just use SSL client certificate it acts as username and password but verification is on browsers and it happens with blink of eye. no bots no search engines even if anyone post link to the file to Facebook etc would be able to see it… without certificate on that computer. it’s very easy.

thank you unborn -

i need to have unrestricted access to the underlying data, but BLOCK the directory names since they are email addresses.

so far my idea seems to be working, for now at least.

when i get back from traveling, i might play around with (dot)access since there are some great examples available involving wildcards, if one can understand them.

Despite the tons of examples and docs, mod_rewrite is voodoo. Damned cool voodoo, but still voodoo.
– Brian Moore bem@news.cmc.net