Jump to content
TNG Community
Katryne

How to use robots.txt

Recommended Posts

Katryne

Hello !

I read the wiki article Using tngrobots.php and I tried to apply. But I think I failed to understand what is to be done.

I do not want my TNG site to be indexed at all. Generally, for my other websites, I modify the robots.txt file. And mine is made this way  User-agent: * Disallow: /

But robots keep coming, mainly compute-1.amazonaws.com  and googlebot.com.

Can you help me, please ?

https://genealogie.revestou.fr/ is made with 12.0.1 beta 2 with GPDR patch. (I know it's a beta release, but I think this robots question is not specific)

Share this post


Link to post
Share on other sites
manofmull

Katryne

I found the best way to keep the spiderbots out is to make your site visitors require login.

I don't have nor need a robots.txt file and I get no crawlers.

My site has 200 members who are happy to log in.

If you want to keep your site open (no login), then you could instal the Bot-Trap mod

https://tng.lythgoes.net/wiki/index.php?title=Bot-Trap_Mod

Robots.txt =

User-agent: *
Disallow: /bot-trap

or you can add a little more

User-agent: *
Disallow: /bot-trap
Disallow: /headstones
Disallow: /photos
Disallow: /documtents
Disallow: /gedcom
Disallow: /backups

Michael

Share this post


Link to post
Share on other sites
Ken Roy

Katryne,

You can also use the .htaccess file to stop the bots from crawling your site.  I have the following in mine

# SetEnvIfNoCase User-Agent "Googlebot" badBot
SetEnvIfNoCase User-Agent "Yahoo" badBot
SetEnvIfNoCase User-Agent "bingbot" badBot
SetEnvIfNoCase User-Agent "MJ12bot" badBot
SetEnvIfNoCase User-Agent "Yandex" badBot
SetEnvIfNoCase User-Agent "BaiDuSpider" badBot
SetEnvIfNoCase User-Agent "AhrefsBot" badBot
SetEnvIfNoCase User-Agent "Mail.ru" badBot
Deny from env=badBot

Note that Googlebot is currently commented out.  see Htaccess Deny

I also have a section in the .htaccess file that stops them from following links in certain files since bots do not necessarily obey the robots.txt file nor the tngrobots.php

# Stop bots from accessing certain pages - Ticket #890739   
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (bot|slurp|spider|crawler) [NC]
RewriteCond %{REQUEST_URI} ^/tng/calendar.php [OR]
RewriteCond %{REQUEST_URI} ^/tng/guestbook.php [OR]
RewriteCond %{REQUEST_URI} ^/tng/familychart.php
RewriteCond %{REQUEST_URI} ^/tng/cousin_marriages.php
RewriteCond %{REQUEST_URI} ^/tng/search.php
RewriteCond %{REQUEST_URI} ^/tng/searchform.php
RewriteRule ^.*$ - [F]

 

Share this post


Link to post
Share on other sites
Katryne

Hello again !

I managed to customize tngrobots.txt and I think it's OK since any page shows in source mode :

Citation

<meta name="robots" content="noindex,nofollow" />

Just what I wanted indeed : no indexing at all.

But some robots keep coming, often. They are mostly robots of this kind : malta1844.startdedicated.de.  Number after malta changing from time to time.

So I tried to install the Bot Trap mod. But I do not think it is installed correctly, since bad robots haven't stopped coming. Maybe because there were already some specs in the .htaccess file.

What can I do please ?

Share this post


Link to post
Share on other sites
manofmull

Katryne

I'm using a robots.txt file now but with site login required, the bots don't get much.

I added Ken's badbot block posted above and I also add IP addresses to the htaccces file (see below).

I installed the Rip Prevention mod https://tng.lythgoes.net/wiki/index.php?title=Rip_Prevention_Mod

This allows you to see all access in Admin. Any bot that gets a ban warning by the mod, I just add to htaccess (IP address)

 

Michael

Share this post


Link to post
Share on other sites
Katryne

Thanks Michael, I will try the RIP prevention mod in addition to htaccess ban and bot trap mod. I'll tell you about the result. maybe I will need to adjust the access delay.

Share this post


Link to post
Share on other sites
Katryne

Well, up to now, it looks as working correctly : the  2 bad IP keep knocking at the door and they are not granted access. Thanks for the tips, I mean everybody, since Ken gave the first clues.

Share this post


Link to post
Share on other sites
cfj

I use * wildcards to Block search engines from crawling.

Eks:
If I will block getperson.php:
Disallow:/genealogy/*getperson

If you have multiple trees in TNG and I will block just one tree.php:
/genealogy/*tree=1

This will block all url that contains tree=1

Block all trees
Disallow:/genealogy/*tree=

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×