Googlebot doesn't seem to be bothering to follow the 301s for these files though.
The redirections are certainly working properly... there is nothing wrong with the mod. Testing it using wget to spoof a googlebot request gives this output:
- Code: Select all
wget 'http://www.lucy-pinder.tv/forum/viewforum.php?f=6&sid=fd52e349e3fec18b2eb9352c126bb97e' -O /dev/null --user-agent 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
--18:25:04-- http://www.lucy-pinder.tv/forum/viewforum.php?f=6&sid=fd52e349e3fec18b2eb9352c126bb97e
=> `/dev/null'
Resolving www.lucy-pinder.tv... 213.162.113.18
Connecting to www.lucy-pinder.tv|213.162.113.18|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.lucy-pinder.tv/forum/what-s-this-all-about-then-f6.html [following]
--18:25:04-- http://www.lucy-pinder.tv/forum/what-s-this-all-about-then-f6.html
=> `/dev/null'
Reusing existing connection to www.lucy-pinder.tv:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=> ] 19,439 --.--K/s
18:25:05 (674.61 KB/s) - `/dev/null' saved [19439]
and the following entries in the server logs:
- Code: Select all
192.168.1.2 - - [19/Jan/2009:18:25:04 +0000] "GET /forum/viewforum.php?f=6&sid=fd52e349e3fec18b2eb9352c126bb97e HTTP/1.0" 301 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
192.168.1.2 - - [19/Jan/2009:18:25:05 +0000] "GET /forum/what-s-this-all-about-then-f6.html HTTP/1.0" 200 19439 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
(there is a reverse proxy between the server and the outside world so every access shows up as coming from 192.168.1.2)
When the real googlebot does it though it never bothers to follow up the 301.
I have therefore added viewforum.php and viewtopic.php to my robots.txt because it seems that something, I don't know what, is confusing the googlebot.
The forum has only been up for a week or so before I installed advanced mod rewrite and advanced zero duplicate, and I only installed them last night, so it seems Google is still looking for things to which it had previously indexed the references.
It also doesn't seem to be respecting robots.txt properly. I have just noticed a googlebot hit on groupcp.php. I have "Disallow: /forum/groupcp.php" in my robots.txt, and Google has read robots.txt since I added that entry, but it doesn't seem to have stopped it.