Hide nonexistent pages from Google

DamianYerrick Since: Jan, 2001

#1: Jul 6th 2012 at 3:08:23 PM

I find a bunch of "Click the edit button" pages in a lot of Google site searches. They're confusing somebody else. I'd recommend adding a robot control directive to make nonexistent pages invisible to Google.

Step to reproduce: Visit Red Link Not The One From Zelda and view source.

Expected: The following markup inside the <head> element.

Actual result: None of the four <meta> elements is a robot control.

TwoGunAngel The Demon Slayer Since: Jul, 2010 Relationship Status: Singularity

The Demon Slayer

#2: Jul 7th 2012 at 9:10:17 AM

Except that the robots thing is kinda necessary for things like the Wayback Machine to do their job. I like to use it from time to time to check out past versions of pages, cut or not, such as when I'm checking out stuff for the CV forums to review, or just to amuse myself. If we could find a way to prevent links to nonexistent pages from showing up on Google while still allowing that kind of functionality, we'd be pretty much golden.

edited 7th Jul '12 9:12:45 AM by TwoGunAngel

Lock Space Wizard from Germany Since: Sep, 2010

Space Wizard

#3: Jul 7th 2012 at 9:44:17 AM

I don't follow.

Existing pages would be able to be indexed by Google, Web-Archive etc., regardless when they existed as long as they have existed.

Non-existing pages would not get indexed and consequently removed from Google.

Where's your problem?

Programming and surgery have a lot of things in common: Don't start removing colons until you know what you're doing.

DamianYerrick Since: Jan, 2001

#4: Jul 7th 2012 at 10:17:00 AM

The problem might be that a cut page might get removed from Wayback if this were implemented. But then this page only claims that /robots.txt results in removal from Wayback, not mentioning what happens to meta.

Probably a better way to do it is send out nonexistent pages with a 404 status code instead of the ordinary 200.

edited 7th Jul '12 10:19:59 AM by DamianYerrick

Lock Space Wizard from Germany Since: Sep, 2010

Space Wizard

#5: Jul 7th 2012 at 10:31:12 AM

Well, then simply do it: <meta name="googlebot" content="noindex, nofollow">

edited 7th Jul '12 10:36:27 AM by Lock

Programming and surgery have a lot of things in common: Don't start removing colons until you know what you're doing.

Xtifr World's Toughest Milkman Since: Jan, 2001 Relationship Status: Having tea with Cthulhu

World's Toughest Milkman

#6: Jul 8th 2012 at 12:19:17 AM

Then all the other search engines will still be collecting the page. If we knew the archive's bot's name, we could maybe forbid all spiders except that one, but I suspect this is all very low priority.

Speaking words of fandom: let it squee, let it squee.

Fighteer Lost in Space from The Time Vortex (Time Abyss) Relationship Status: TV Tropes ruined my love life

Lost in Space

#7: Jul 8th 2012 at 6:59:46 AM

Non-existing pages should return a 404. That would solve the problem entirely as I understand it.

"It's Occam's Shuriken! If the answer is elusive, never rule out ninjas!"

SeptimusHeap from Switzerland (Edited uphill both ways) Relationship Status: Mu

#8: Jul 8th 2012 at 7:02:47 AM

Non existing URLs (which search engines don't show) will return a 404. Non-existing pages (which they do show) give "Click the edit button to start this new page"

"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." - Richard Feynman

Fighteer Lost in Space from The Time Vortex (Time Abyss) Relationship Status: TV Tropes ruined my love life

Lost in Space

#9: Jul 8th 2012 at 7:07:23 AM

Perhaps I'm misunderstanding, but I'm almost certain that you can provide a custom 404 page and retain the underlying URL.

"It's Occam's Shuriken! If the answer is elusive, never rule out ninjas!"

SeptimusHeap from Switzerland (Edited uphill both ways) Relationship Status: Mu

#10: Jul 8th 2012 at 7:09:19 AM

The problem are not these pages (which Google doesn't turn up), but these, which Google does show. The latter can't be 404'ed.

"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." - Richard Feynman

jkbeta from right behind you Since: Dec, 2010 Relationship Status: You cannot grasp the true form

#11: Jul 8th 2012 at 7:27:38 AM

Theoretically, you can 404 such a page and put the regular page HTML there, but it's questionable if this will work correctly for all users (because of browser and proxy compatibility issues).

Lock Space Wizard from Germany Since: Sep, 2010

Space Wizard

#12: Jul 8th 2012 at 8:06:13 AM

Then all the other search engines will still be collecting the page.

Most of this problem relates to the "in-house" search which is offered by Google and even if somebody should use a search engine directly most of the time they would use Google anyway. What you're referring to are about 1-5% of the "affected".

If we knew the archive's bot's name, we could maybe forbid all spiders except that one, but I suspect this is all very low priority.

Internet Archive is "ia_archiver", but the problem here is: Just with meta="robot", you can't do whitelisting, only block all crawlers or specific crawlers.

What you mean is
<meta name="robots" content="noindex, nofollow">
<meta name="ia_archiver" content="index, follow">

But that also excludes ia_archiver.

One can do whitelisting, but that is only possible with robots.txt which works with directories and thus doesn't help much here, where we need it on a page by page basis.

If you really think it's that important, we could blacklist the crawlers of the most important search engines (I still think just excluding Google does the job fine).

Programming and surgery have a lot of things in common: Don't start removing colons until you know what you're doing.

HiddenWindshield King of Crayons from somewhere else Since: Aug, 2010

King of Crayons

#13: Jul 8th 2012 at 1:37:55 PM

Yes, you can 404 a Non Existing Page. It really should be as simple as adding:

if (pageIsRedLink()) { header("HTTP/1.0 404 Not Found"); }

to pmwiki.php, before any other output is generated. It would keep non-existent pages off of Google and other search engines, but you could still get to those pages to create them if need be.

(Note: If TV Tropes is using FastCGI, the header line would be "header("Status: 404 Not Found");" instead.)

I teleported home last night / With Ron and Sid and Meg / Ron stole Meggy's heart away / And I got Sidney's leg - Douglas Adams

DamianYerrick Since: Jan, 2001

#14: Jul 11th 2013 at 8:57:11 AM

#11: Internet Explorer has a quirk called "friendly error messages" where it will replace any error page shorter than 512 bytes with its own error page, but we don't need to worry about that too much because our page for a nonexistent article is far longer than 512 bytes.

edited 11th Jul '13 8:57:43 AM by DamianYerrick

Add Post

Back Top

Total posts: 14