Jump to content
TNG Community

New robots/search engine strategy (tngrobots.php)


Darrin Lythgoe

Recommended Posts

Darrin Lythgoe

Hi everyone,

Per my recent posting on the mailing list, I have created a new file called tngrobots.php that directs TNG which meta restrictions pertaining to robots to include on which pages.

I have attempted to categorize each script, with content-rich pages getting full indexing, link-rich pages getting no indexing but yes to "following", and everything else getting "no index, no follow".

There are definitely some gray areas there, however, and I would appreciate any feedback you might have about what belongs where and why (don't forget the "why" if you want to convince me).

The beauty of this system will be that you can tweak this to your hearts content all from one page, but I'd like to optimize it before I release it to the public.

Anyway, here's the code:

<?php

if( !$cms[support] )

$tngscript = basename( $SCRIPT_NAME, ".php" );

else

$tngscript = $file;

//No index only

$NOI = "<meta name="robots" content="noindex">n";

//No follow only

$NOF = "<meta name="robots" content="nofollow">n";

//No index AND no follow

$NOINOF = "<meta name="robots" content="noindex,nofollow">n";

//each "case" is the name of the script file without the ".php" at the end

switch( $tngscript ) {

//allow full indexing

case "cemeteries":

case "getperson":

case "familygroup":

case "headstones":

case "showheadstone":

case "showmap":

case "showphoto":

case "showrepo":

case "showsource":

case "showtree":

case "surnames":

case "surnames-all":

case "surnames-oneletter":

$flags[norobots] = "";

break;

//no indexing, but allow link following

case "browsedocs":

case "browseheadstones":

case "browsenotes":

case "browsephotos":

case "browserepos":

case "browsesources":

case "browsetrees-old":

case "descend":

case "extrastree":

case "register":

case "reports":

case "search":

case "showreport":

case "ahnentafel":

case "pedigree":

case "pedigreetext":

case "surnames100":

case "ultraped":

$flags[norobots] = $NOI;

break;

//no index, no follow

case "addnewacct":

case "anniversaries":

case "browsetrees":

case "changelanguage":

case "desctracker":

case "gedform":

case "login":

case "newacctform":

case "places-all":

case "places-oneletter":

case "places":

case "placesearch":

case "places100":

case "relateform":

case "relationship":

case "searchform":

case "sendlogin":

case "showlog":

case "suggest":

case "timeline2":

case "whatsnew":

default:

$flags[norobots] = $NOINOF;

break;

}

?>

Thanks!

Darrin

Link to comment
Share on other sites

I'm wondering about the thoughts behind having some pages that are "no index but allow followiing"

Wasn't the whole origin of the complaints that people are affected by the bandwidth used by the robots, and allowing them to follow the links is going to still use up bandwidth even if the robot doesn't then index the page it has just followed from?

Roger

Link to comment
Share on other sites

The robots don't seem to respect the meta links - I have tried the no index, nofollow stuff without success in the case of Google, MSN and Intomi Slurp.

Does your robots.txt file pass a validation test - for example the one at

http://www.searchengineworld.com/cgi-bin/r.../robotcheck.cgi

I used this to discover that my very simple file that I'd copied directly from a site about robots.txt files didn't validate because I'd saved it on a Macintosh, and used the default of Macintosh line endings. Once I changed to Unix line endings it now validates.

Roger

Link to comment
Share on other sites

I used this to discover that my very simple file that I'd copied directly from a site about robots.txt files didn't validate because I'd saved it on a Macintosh, and used the default of Macintosh line endings. Once I changed to Unix line endings it now validates.

Your original file was fine. It's the validator that's broken. The robots.txt standard explicitly allows any line endings:

The format and semantics of the "/robots.txt" file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>". The field name is case insensitive.

Of course, some robots are probably similarly broken. I know some are incorrectly case-sensitive with regards to the field names.

Brad

Link to comment
Share on other sites

  • 2 years later...
  • 2 years later...
palmspringsbum

I just upgraded to 8.1.

msnbot-157-55-116-14.search.msn.com is still crawling my descend files. Nothing else appears to be crawling them now.

I think that is also the bot that nearly doubled my band-width over the past couple of weeks, adding 10G, putting me 6G over my limit.

That bot has bot to go. It appears it was guzzling the band-width on descendent trees and on relationship trees.

Help me muzzle that bot.

Link to comment
Share on other sites

  • 2 months later...
palmspringsbum

The robots don't seem to respect the meta links - I have tried the no index, nofollow stuff without success in the case of Google, MSN and Intomi Slurp.

Chris

I'm having the same problem with the same bots, they are indexing and following the "descend" files, and it appears they are going through every single permutation, using twice as much of my bandwidth as "viewed files".

Last month the bots gobbled over 15 GB (that's right, GIGS) of my bandwidth, over 90% of it in the last few days of the month. "viewed traffic" is about 7.5GB.

Could the problem be that the "descend" files need to be named explicitly?

Here are my top ten URLs by kbytes:

#     Hits     KBytes     URL
1     92121     9.04%     2153029     8.95%     /genealogy/getperson.php
2     103     0.01%     2015860     8.38%     /bin/brb060223.mp3
3     31865     3.13%     1782551     7.41%     /genealogy/desctracker.php
4     36302     3.56%     1433942     5.96%     /genealogy/pedigree.php
5     48502     4.76%     1065713     4.43%     /genealogy/descendtext.php
6     15485     1.52%     837550     3.48%     /genealogy/descend.php
7     1408     0.14%     749237     3.11%     /blog/about/
8     1425     0.14%     657778     2.73%     /blog/store/
9     20556     2.02%     580061     2.41%     /bbs/viewtopic.php
10     23127     2.27%     540937     2.25%     /genealogy/familygroup.php

- desctracker.php

- descendtext.php

- descend.php

- pedigree.php

Should not be indexed or followed. That's 5.2 GB of my bandwidth right there.

Link to comment
Share on other sites

  • 1 month later...
palmspringsbum

No replies?

Just as well. I just checked and my viewed-to-bots bandwidth ratio is currently about 2:1, about 8M to 4M.

That is a complete reversal.

After trying everything else, I spent some time on the robots.txt file, telling the bots to ignore just about everything but "getperson.php".

I had been going at this with the assumption that if bots were ignoring the "nofollow,noindex" they certainly wouldn't pay any attention to a robots.txt file. It seems I was wrong.

Oh, yeah, I put robots.txt files in the subdirectories/subdomains as well.

I also recall that in the process I discovered all my subdomains had somehow gotten screwed up and were pointing at the wrong place, if they were pointing anywhere at all, and so I fixed the subdomain redirects.

Link to comment
Share on other sites

Jay Wilpolt

No replies?

Just as well. I just checked and my viewed-to-bots bandwidth ratio is currently about 2:1, about 8M to 4M.

That is a complete reversal.

After trying everything else, I spent some time on the robots.txt file, telling the bots to ignore just about everything but "getperson.php".

I had been going at this with the assumption that if bots were ignoring the "nofollow,noindex" they certainly wouldn't pay any attention to a robots.txt file. It seems I was wrong.

Oh, yeah, I put robots.txt files in the subdirectories/subdomains as well.

I also recall that in the process I discovered all my subdomains had somehow gotten screwed up and were pointing at the wrong place, if they were pointing anywhere at all, and so I fixed the subdomain redirects.

Here is my robots.txt file

It's quite restrictive, so you may want to remove some info.

You need to change the paths to match your path from your hosting ROOT folder.

Hope this helps.

Jay

robots.txt

Link to comment
Share on other sites

Larry Harrell

Hi everyone,

Per my recent posting on the mailing list, I have created a new file called tngrobots.php that directs TNG which meta restrictions pertaining to robots to include on which pages.

I have attempted to categorize each script, with content-rich pages getting full indexing, link-rich pages getting no indexing but yes to "following", and everything else getting "no index, no follow".

There are definitely some gray areas there, however, and I would appreciate any feedback you might have about what belongs where and why (don't forget the "why" if you want to convince me).

The beauty of this system will be that you can tweak this to your hearts content all from one page, but I'd like to optimize it before I release it to the public.

Anyway, here's the code:

<?php

if( !$cms[support] )

$tngscript = basename( $SCRIPT_NAME, ".php" );

else

$tngscript = $file;

//No index only

$NOI = "<meta name="robots" content="noindex">n";

//No follow only

$NOF = "<meta name="robots" content="nofollow">n";

//No index AND no follow

$NOINOF = "<meta name="robots" content="noindex,nofollow">n";

//each "case" is the name of the script file without the ".php" at the end

switch( $tngscript ) {

//allow full indexing

case "cemeteries":

case "getperson":

case "familygroup":

case "headstones":

case "showheadstone":

case "showmap":

case "showphoto":

case "showrepo":

case "showsource":

case "showtree":

case "surnames":

case "surnames-all":

case "surnames-oneletter":

$flags[norobots] = "";

break;

//no indexing, but allow link following

case "browsedocs":

case "browseheadstones":

case "browsenotes":

case "browsephotos":

case "browserepos":

case "browsesources":

case "browsetrees-old":

case "descend":

case "extrastree":

case "register":

case "reports":

case "search":

case "showreport":

case "ahnentafel":

case "pedigree":

case "pedigreetext":

case "surnames100":

case "ultraped":

$flags[norobots] = $NOI;

break;

//no index, no follow

case "addnewacct":

case "anniversaries":

case "browsetrees":

case "changelanguage":

case "desctracker":

case "gedform":

case "login":

case "newacctform":

case "places-all":

case "places-oneletter":

case "places":

case "placesearch":

case "places100":

case "relateform":

case "relationship":

case "searchform":

case "sendlogin":

case "showlog":

case "suggest":

case "timeline2":

case "whatsnew":

default:

$flags[norobots] = $NOINOF;

break;

}

?>

Thanks!

Darrin

Darrin,

Does this code go in index.php or do we need tngrobots.php and if so can you send tngrobots.php with instructions as to how to connect to the TNG main index.php page.

Larry

Link to comment
Share on other sites

  • 2 months later...
Henrik Poulsen

Has anyone found out how the tngrobots.php works?

Is it right to assume hat no robots.txt is needed..? and that editsis done in tngrobots.php?

Link to comment
Share on other sites

  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...