Except that the robots thing is kinda necessary for things like the Wayback Machine to do their job. I like to use it from time to time to check out past versions of pages, cut or not, such as when I'm checking out stuff for the CV forums to review, or just to amuse myself. If we could find a way to prevent links to nonexistent pages from showing up on Google while still allowing that kind of functionality, we'd be pretty much golden.
edited 7th Jul '12 9:12:45 AM by TwoGunAngel
I don't follow.
Existing pages would be able to be indexed by Google, Web-Archive etc., regardless when they existed as long as they have existed.
Non-existing pages would not get indexed and consequently removed from Google.
Where's your problem?
Programming and surgery have a lot of things in common: Don't start removing colons until you know what you're doing.The problem might be that a cut page might get removed from Wayback if this were implemented. But then this page only claims that /robots.txt results in removal from Wayback, not mentioning what happens to meta.
Probably a better way to do it is send out nonexistent pages with a 404 status code instead of the ordinary 200.
edited 7th Jul '12 10:19:59 AM by DamianYerrick
- Well, then simply do it
- <meta name="googlebot" content="noindex, nofollow">
edited 7th Jul '12 10:36:27 AM by Lock
Programming and surgery have a lot of things in common: Don't start removing colons until you know what you're doing.Then all the other search engines will still be collecting the page. If we knew the archive's bot's name, we could maybe forbid all spiders except that one, but I suspect this is all very low priority.
Speaking words of fandom: let it squee, let it squee.Non-existing pages should return a 404. That would solve the problem entirely as I understand it.
"It's Occam's Shuriken! If the answer is elusive, never rule out ninjas!"Non existing URLs (which search engines don't show) will return a 404. Non-existing pages (which they do show) give "Click the edit button to start this new page"
"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." - Richard FeynmanPerhaps I'm misunderstanding, but I'm almost certain that you can provide a custom 404 page and retain the underlying URL.
"It's Occam's Shuriken! If the answer is elusive, never rule out ninjas!"Theoretically, you can 404 such a page and put the regular page HTML there, but it's questionable if this will work correctly for all users (because of browser and proxy compatibility issues).
What you mean is
<meta name="robots" content="noindex, nofollow">
<meta name="ia_archiver" content="index, follow">
But that also excludes ia_archiver.
One can do whitelisting, but that is only possible with robots.txt which works with directories and thus doesn't help much here, where we need it on a page by page basis.
If you really think it's that important, we could blacklist the crawlers of the most important search engines (I still think just excluding Google does the job fine).
Programming and surgery have a lot of things in common: Don't start removing colons until you know what you're doing.Yes, you can 404 a Non Existing Page. It really should be as simple as adding:
if (pageIsRedLink()) { header("HTTP/1.0 404 Not Found"); }
to pmwiki.php, before any other output is generated. It would keep non-existent pages off of Google and other search engines, but you could still get to those pages to create them if need be.
(Note: If TV Tropes is using FastCGI, the header line would be "header("Status: 404 Not Found");" instead.)
I teleported home last night / With Ron and Sid and Meg / Ron stole Meggy's heart away / And I got Sidney's leg - Douglas Adams#11: Internet Explorer has a quirk called "friendly error messages" where it will replace any error page shorter than 512 bytes with its own error page, but we don't need to worry about that too much because our page for a nonexistent article is far longer than 512 bytes.
edited 11th Jul '13 8:57:43 AM by DamianYerrick
I find a bunch of "Click the edit button" pages in a lot of Google site searches. They're confusing somebody else. I'd recommend adding a robot control directive to make nonexistent pages invisible to Google.
Step to reproduce: Visit Red Link Not The One From Zelda and view source.
Expected: The following markup inside the <head> element.
Actual result: None of the four <meta> elements is a robot control.