3

I'm getting the following error when trying to use Simple HTML Dom inside a web crawler class. The class seems to be working well but I get many errors in my error_log file.

[01-Apr-2016 23:16:51 UTC] PHP Warning:  Invalid argument supplied for foreach() in /home/scrybs/public_html/order/uploader/php/simple_html_dom.php on line 357

If I check Simple HTML Dom, the error comes from here:

function innertext()
    {
        if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER];
        if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);

        $ret = '';
        foreach ($this->nodes as $n)
            $ret .= $n->outertext();
        return $ret;
    }

The crawler class in question is as following:

class crawler
{
    protected $_url;
    protected $_depth;
    protected $_host;
    protected $_useHttpAuth = false;
    protected $_user;
    protected $_pass;
    protected $_seen = array();
    protected $_filter = array();
    public $contenu = array();

    public function __construct($url, $depth = 5)
    {
        $this->_url = $url;
        $this->_depth = $depth;
        $parse = parse_url($url);
        $this->_host = $parse['host'];
        $this->html = new simple_html_dom();
    }

    protected function _processAnchors($content, $url, $depth)
    {
        //$dom = new DOMDocument('1.0');
        //@$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
        //$dom->formatOutput = true;
        $this->html->load($content);

        $metatitle = $this->html->find('title',0)->innertext;
        foreach($this->html->find("meta[name='description']") as $element){
            $metadescription = $element->content;
        }
        foreach($this->html->find("meta[name='keywords']") as $element){
            $metakeywords = $element->content;
        }

        if(!empty($metatitle)){                         
            $this->contenu['meta_titles'][] = $metatitle;
        }
        if(!empty($metadescription)){
            $this->contenu['meta_titles'][] = $metadescription;
        }
        if(!empty($metakeywords)){
            $this->contenu['meta_titles'][] = $metakeywords;
        }

        // IMAGE ALTS
        foreach($this->html->find('img') as $e){
            if(!empty($e->alt)){
                if(!$this->search_array($e->alt, $this->contenu)){
                    $this->contenu['alt_images'][] = $e->alt;
                }
            }
        }

        // LINKS
        $links = $this->html->find('a');
        foreach($links as $element){ 
            // GET LINK TEXTS
            $a = $element->innertext;
            $a = preg_replace("/<a.*?>(.*?)<\/a>/", '\1', $a);
            $a = preg_replace("/<p.*?>.*?<\/p>/", "{{P}}", $a);
            $a = preg_replace("/<img.*?>/", "{{IMG}}", $a);
            $a = preg_replace('#(<br */?>\s*)+#i', "{{BR}}", $a);
            $a = preg_replace('#<button.*?>.*?</button>#i', '{{BUTTON}}', $a);
            $a = preg_replace('#<time.*?>(.*?)</time>#i', '{{TIME}}', $a);
            $a = preg_replace('#<span.*?>(.*?)</span>#i', '{{SPAN}}\1{{/SPAN}}', $a);
            $a = preg_replace('#<strong.*?>(.*?)</strong>#i', '{{STRONG}}\1{{/STRONG}}', $a);
            $a = preg_replace('#<b.*?>(.*?)</b>#i', '{{B}}\1{{/B}}', $a);
            $a = preg_replace('#<i.*?>(.*?)</i>#i', '{{I}}\1{{/I}}', $a);
            $a = preg_replace('#<small.*?>(.*?)</small>#i', '{{SMALL}}\1{{/SMALL}}', $a);
            $a = preg_replace('#<abbr.*?>(.*?)</abbr>#i', '{{ABBR}}\1{{/ABBR}}', $a);
            $a = trim(strip_tags($a));
            $a = preg_replace('/\s+/', ' ', $a);
                // CHECK IF NOT ONLY VARIABLES AND SPACES
                $atmp = strip_tags($a);
                $atmp = preg_replace("/{{.*?}}/", '', $atmp);
                $atmp = preg_replace('/\s+/', '', $atmp);
            if(!empty($a) && $a != '' && $atmp != ''){
                if(!$this->search_array($a, $this->contenu)){
                    $this->contenu['link_texts'][] = $a;
                }
            }

            // GET LINK TITLES
            $title = $element->title;
            if(!empty($title)){
                if(!$this->search_array($title, $this->contenu)){
                    $this->contenu['link_titles'][] = $title;
                }
            }

            $href = $element->href;
                if (0 !== strpos($href, 'http')) {
                    $path = '/' . ltrim($href, '/');
                    if (extension_loaded('http')) {
                        $href = http_build_url($url, array('path' => $path));
                    } else {
                        $parts = parse_url($url);
                        $href = $parts['scheme'] . '://';
                        if (isset($parts['user']) && isset($parts['pass'])) {
                            $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                        }
                        $href .= $parts['host'];
                        if (isset($parts['port'])) {
                            $href .= ':' . $parts['port'];
                        }
                        $href .= $path;
                    }
                }
            // Crawl only link that belongs to the start domain
            $this->crawl_page($href, $depth - 1);
        }
        return $this->contenu;
    }

    protected function _getContent($url)
    {
        $handle = curl_init($url);
        if ($this->_useHttpAuth) {
            curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
            curl_setopt($handle, CURLOPT_USERPWD, $this->_user . ":" . $this->_pass);
        }
        // follows 302 redirect, creates problem wiht authentication
//        curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
        // return the content
        curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

        /* Get the HTML or whatever is linked in $url. */
        $response = curl_exec($handle);
        // response total time
        $time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
        /* Check for 404 (file not found). */
        $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);

        curl_close($handle);
        return array($response, $httpCode, $time);
    }

    protected function _printResult($url, $depth, $httpcode, $time)
    {
        ob_end_flush();
        $currentDepth = $this->_depth - $depth;
        $count = count($this->_seen);
        //echo "N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url <br>";
        ob_start();
        flush();
    }

    protected function isValid($url, $depth)
    {
        if (strpos($url, $this->_host) === false
            || $depth === 0
            || isset($this->_seen[$url])
            || preg_match("/#/i", $url)
            || preg_match("/.png/i", $url)
            || preg_match("/.jpg/i", $url)
            || preg_match("/.jpeg/i", $url)
            || preg_match("/.gif/i", $url)
            || preg_match("/.pdf/i", $url)
            || preg_match("/javascript/i", $url)
            || preg_match("/twitter.com/i", $url)
            || preg_match("/google.com/i", $url)
            || preg_match("/facebook.com/i", $url)
            || preg_match("/youtube.com/i", $url)
            || preg_match("/instagram.com/i", $url)
            || preg_match("/wp-login.php/i", $url)
        ) {
            return false;
        }
        foreach ($this->_filter as $excludePath) {
            if (strpos($url, $excludePath) !== false) {
                return false;
            }
        }
        return true;
    }

    public function search_array($needle, $haystack) {
         if(in_array($needle, $haystack)) {
              return true;
         }
         foreach($haystack as $element) {
              if(is_array($element) && $this->search_array($needle, $element))
                   return true;
         }
       return false;
    }

    public function crawl_page($url, $depth)
    {
        if (!$this->isValid($url, $depth)) {
            return;
        }
        // add to the seen URL
        $this->_seen[$url] = true;
        // get Content and Return Code
        list($content, $httpcode, $time) = $this->_getContent($url);
        // print Result for current Page
        //$this->_printResult($url, $depth, $httpcode, $time);
        // process subPages
        $this->_processAnchors($content, $url, $depth, $contenu = array());
    }

    public function addFilterPath($path)
    {
        $this->_filter[] = $path;
    }

    public function run()
    {
        $this->crawl_page($this->_url, $this->_depth);
    }
}

The error seems to be coming from this line related to innertext function:

// GET LINK TEXTS
$a = $element->innertext;

I don't get any error when I use:

 $a = $element->innertext;

But not ideal as I would like to keep HTML tags. I don't get any error when I use Simple HTML Dom outside the class so does it have something to do with the fact that Simple HTML Dom is in a class? Do somebody have an idea?

Thanks for your help!

2
  • 1
    Can you provide the processed url? Also, I see that you have DOMDocument lines commented: you have tried also with DOM? Commented Apr 1, 2016 at 23:39
  • @fusion3k I tried with DOMDocument but I find it way more complicated to get the result I want, as it needs some tricks to get the inner text of HTML tags... Commented Apr 2, 2016 at 8:23

1 Answer 1

2

I have found the bug.

On my (limited) tests, the problem happens when you set depth > 1, so — seeing your code — when you load more than one page URL. One of the countless Simple HTML DOM problems, is that ->load() method doesn't work correctly on multiple loads.

Re-instantiating html object, the script seems work:

protected function _processAnchors( $content, $url, $depth )
{
    $this->html = new simple_html_dom();                                    # <-----
    $this->html->load( $content );

I tested also $this->html = str_get_html($content); but it works only on limited sites.

Additional Note: In HTML <title> tag is mandatory, but not all sites has well formatted HTML: consider checking for <title> tag (and for each tag) existence to avoid additional errors.

Sign up to request clarification or add additional context in comments.

1 Comment

Working like a charm!! Hats off! Thanks a lot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.