A bug in google_xml.php file

The GYM Sitemaps & RSS module for phpBB. Sitemaps and RSS feeds for Google Yahoo! and MSN Live, support, add ons etc ...

Moderator: Moderators

A bug in google_xml.php file

Postby wwwmaster » Sun Jan 04, 2009 4:08 am

Hello,

I want to first thank you for the truly wonderful package of code you have implemented in this GYM sitemaps package. And the other phpBB SEO packages aren't bad either.

Anyway, I found two bugs in google_xml.php in 2.0RC3 version.

On line 88 you test if the site map to be loaded is a local file or a URL and set the timeout accordingly. It is tested with this code:

if (strpos('http://', $xml_file) !== false)

That test always fails, the test should be:

if (strpos($xml_file, 'http://') !== false)


Then, another bug later in same code, on line 113.

If full XML parsing is not enabled with external sitemaps, (Last modification: no in XML Sitemaps config), the data is simply copied to use.

However, link count for the sitemap is set to 'n/a', which causes the actual sitemap page shown by sitemap.php to give the error message "This page does not contain enough item to be displayed.".

A quick and somewhat dirty fix to this would be removing this line, and adding the following line above unset($xml_data) -line:

str_replace('<url>', '', $xml_data, $this->outputs['urls_sofar']);

This would count the number of URL tags in the XML data, and put the number in the urls_sofar entry in the array.


The other way to resolve this problem would be to change the comparison code for this url_sofars variable, so that it would accept n/a as a positive match so that the sitemap could be shown.
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Advertisement

Postby wwwmaster » Sun Jan 04, 2009 5:52 am

I also tried to use it with the Last modification setting turned on, but then the code failed to parse the XML document.

The included xml2array() function doesn't return anything for the sitemap, which is perfectly fine according to Google.

Perhaps you should use PHP's integrated XML parser class and parse the XML to an array?

I think I will make that code later.
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby SeO » Sun Jan 04, 2009 9:00 am

Good work :D

I'll commit the strpos('http://', $xml_file) bug right now to make sure we won't skip it in next version.

The n/a bug is a left over from back when sitemaps had no thresholds.
To stay compatible with php4, I'd replace :
Code: Select all
               // free memory
               unset($xml_data);
               // No link count here
               $this->outputs['url_sofar'] = 'n/a';

with :
Code: Select all
               // Link count
               $this->outputs['url_sofar'] = count(explode('<url>', $xml_data));
               // free memory
               unset($xml_data);


About the xml2array function, well, I need to investigate a bit more, but as a dumb check, did you use a sitemapindex ?

I know that php most times comes with xml functions, but we though that a simple function like this was enough and had the great advantage to work in all php versions (at least from php4).

I'm actually interested into your views about it, but I think that keeping it is the way to go. By the way, it should work, I'll report here as soon as I get time to test more ;)
SeO
Admin
Admin
 
Posts: 6334
Joined: Wed Mar 15, 2006 9:41 pm

Postby wwwmaster » Sun Jan 04, 2009 4:51 pm

Yes, I am using a real sitemap, at http://www.pokeri.com/sitemap.xml

I have tested the regexp, and if it is like this:

$reg_exp = '#<(\w+)[^>]*>(.*?)</\1>#s';

It does not work, it matches zero times. However, if I remove the backreference:

$reg_exp = '#<(\w+)[^>]*>(.*?)</url>#s';

It matches the data correctly.

When looking at the matched items array with the latter regexp, I can see that the starting tag <url> is matched correctly, but the backreference doesn't work.
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby wwwmaster » Sun Jan 04, 2009 5:01 pm

And even with \\1 as the backreference, it fails.
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby SeO » Sun Jan 04, 2009 5:05 pm

Interesting, it could be due to your php settings.

Anyway, the reg ex could be simplified to :

Code: Select all
$reg_exp = '#<url>(.*)</url>#Ui';


for the usage we have, it should be enough and work in more cases.
SeO
Admin
Admin
 
Posts: 6334
Joined: Wed Mar 15, 2006 9:41 pm

Postby wwwmaster » Sun Jan 04, 2009 6:26 pm

I found the problem..

My sitemap contains line feeds, and for some reason the original regexp doesn't work with it.

However, if I use this regexp, the backreferences and the code work:

$reg_exp = '#<(\w+)>(.*?)</\\1>#s';

That is, I remove the part designed to match tag attributes. In this case it should be OK, since sitemaps don't contain attributes?
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby wwwmaster » Sun Jan 04, 2009 6:36 pm

Actually the problem isn't yet solved..

The urlset -tag contains attributes, and now the xml2array() -function doesn't parse the starting urlset -tag..

I'll keep digging :)
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby wwwmaster » Sun Jan 04, 2009 8:17 pm

This is a replacement for the xml2array() -function which works correctly for me:

Code: Select all
function xml2array($text, $inner=0) {
        if ($inner) {
                $reg_exp = '`<(\w+)[^>]*>(.*?)</\1>`s';
        } else {
                $reg_exp = '`<(\w+)[^>]*>(.*)</\1>`s';
        }
        preg_match_all($reg_exp, $text, $match);
        foreach ($match[1] as $key=>$val) {
                if ( preg_match($reg_exp, $match[2][$key]) ) {
                        $array[$val][] = $this->xml2array($match[2][$key],1);
                } else {
                        $array[$val] = $match[2][$key];
                }
        }
        return $array;
}


The problem was that the regexp with lazy matching (.*?) didn't work with the outer <urlset> tag in the sitemap file. So, therefore I use a different regexp on the whole sitemap.
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby SeO » Mon Jan 05, 2009 9:26 am

Oh, if so, I think that you can just use one reg ex :
Code: Select all
$reg_exp = '`<(\w+)[^>]*>(.*)</\1>`s';


The question mark is a bit useless after a *, since it's already stating that the captured string could be empty.

Thinking about it a bit more, I think that the correct reg ex should rather be :
Code: Select all
$reg_exp = '`<(\w+)[^>]*>(.*)</\1>`Ui';


"U" stands for ungreedy, meaning that the first matching closing tag will be used, it's actually better than the "s" option, which is only asking for a single line of string (url tag could after all take two or three lines depending on the way it was done).

This function was originally taken from php.net, but there is always way to do better. I kind of like it's simplicity though ;)
SeO
Admin
Admin
 
Posts: 6334
Joined: Wed Mar 15, 2006 9:41 pm

Postby wwwmaster » Mon Jan 05, 2009 11:56 pm

SeO wrote:Thinking about it a bit more, I think that the correct reg ex should rather be :
Code: Select all
$reg_exp = '`<(\w+)[^>]*>(.*)</\1>`Ui';


"U" stands for ungreedy, meaning that the first matching closing tag will be used, it's actually better than the "s" option, which is only asking for a single line of string (url tag could after all take two or three lines depending on the way it was done).


s stands for PCRE_DOTALL option, that is, the dot matches newlines too. Without the s option, the regexp matches only tags where the starting and closing tag are at the same line. The outermost <urlset> tag practically always has linefeeds in its contents, therefore the regexp cannot match it. (I tested this regexp with my sitemap and it didn't work. :))

So, building a regexp to handle recursive parsing of XML tags is not that simple..


Anyway, another problem with the sitemap parsing is that in my case, the script exceeded allowed memory consumption at my webhost. The sitemap had 885 URLs. So, the code might also need other changes to work with smaller memory consumption.
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby SeO » Tue Jan 06, 2009 1:06 pm

You're right about the s option, so :
Code: Select all
$reg_exp = '`<(\w+)[^>]*>(.*)</\1>`Usi';


should be our final answer ;)

About the performances, well, parsing big files requires some memory, there is not much to do about it, you can try the php xml functions, but I'm not sure you'll save this much ram.

A roadmap for a faster output when parsing xml could be to do the job with preg_replace directly to perfom some basic tasks such as updating the lastmod tags for example, but no shuffle or robots.txt checking could be done without parsing all tags.

So I guess that you can still use the external sitemaps without parsing them at all, if the original sitemaps does not include the lastmod tags, it could be enough.

Do you have any data about the memory limit attained ?
Because usually phpBB3 needs about 16M to run well, it's even often advised to set it up to 32M. So I'm not sure, but very huge dynamic sitemaps have been built with a lot less ram than that, have you seen the record thread ?

All this to say that if 8M was you ram limit, chances are great that you'll run into memory limit trouble one day or the other with phpBB3 only.
SeO
Admin
Admin
 
Posts: 6334
Joined: Wed Mar 15, 2006 9:41 pm

Postby wwwmaster » Sat Jan 10, 2009 8:40 am

SeO wrote:You're right about the s option, so :
Code: Select all
$reg_exp = '`<(\w+)[^>]*>(.*)</\1>`Usi';


should be our final answer ;)


That doesn't work either.. The only one that works for me until now is the solution I presented above.

SeO wrote:Do you have any data about the memory limit attained ?
Because usually phpBB3 needs about 16M to run well, it's even often advised to set it up to 32M. So I'm not sure, but very huge dynamic sitemaps have been built with a lot less ram than that, have you seen the record thread ?


The memory limit is 128 MB on the hosting service. Maybe there was some other problem with the code, I'll have to check it once I have time.

However, memory consumption depends quite much on the way how a particular feature is implemented. I haven't studied sitemap generation code, but I assume that there are no arrays with several hundred entries..

Actually the code to count the number of links:

Code: Select all
               // Link count
               $this->outputs['url_sofar'] = count(explode('<url>', $xml_data));
               // free memory
               unset($xml_data);


uses much more memory than the code I used. This one generates a temporary array of the data. With large sitemaps memory consumption is a problem too.
wwwmaster
 
Posts: 22
Joined: Sat Mar 10, 2007 9:07 pm

Postby dcz » Sat Jan 10, 2009 10:12 am

Parsing a big flat file will always require some memory, the google could certainly do better, but as long as you'll parse the entire file, it will be pretty heavy.

Now, it's not necessarily an issue, since the result can (and should be cached) and would thus later be sent to the client with a simple readfile on the cached flat file (hard to use less memory that that). If the cache refresh is to heavy, you can still deactivate cache auto refresh for the module and then update it by hand once in a while, or, just not parse the source file and use it pretty much as is in GYM (all option of the first group set to no).

About the 128M limit, I'm not sure that the xml module never went over this since it mainly depend on the size of the source, but what is for sure is that you should be able to dynamically (with the regular forum module) generate and cache over 10 000 urls with less than 10 mo, at least this is what the benchmark shows.

++
Useful links :
SEO Forum || SEO Directory || SEO phpBB || Search
____________________

Liens Utiles :
Forum référencement || Annuaire référencement || Référencement phpBB || Recherche
dcz
Admin
Admin
 
Posts: 21325
Joined: Fri Apr 28, 2006 9:03 pm


Return to GYM Sitemaps & RSS

 


  • Related topics
    Replies
    Views
    Last post

Who is online

Users browsing this forum: No registered users and 7 guests