Parsing HTML Table Data from XML with PHP

Question

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.

Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).

Here is my code so far (code updated):

Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.

function getXMLFeed() {
    echo "<h2>Reddit Items</h2><hr><br><br>";
    //$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
    $feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
    $xml = simplexml_load_file($feedURL);
    //define each xml entry from reddit as an item
    foreach ($xml -> entry as $item ) {
        foreach ($item -> content as $content) {
            $newContent = (string)$content;    
            $html = str_get_html($newContent);

            foreach($html->find('table') as $table) {
                $links = $table->find('span', '0');
                //echo $links;
                foreach($links->find('a') as $link) {
                    echo $link->href;
                }
            }
        }
    }
}

XML Code: http://pasted.co/0bcf49e8

I've also included JSON if it can be done this way; I just preferred XML: http://pasted.co/f02180db

That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).

    foreach ($item -> content as $content) {
       $dom = new DOMDocument();
       $dom -> loadHTML($content);
       $xpath = new DOMXPath($dom);
       $classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";



       foreach ($dom->getElementsByTagName('table') as $node) {
          echo $dom->saveHtml($node), PHP_EOL;
          //$originalURL = $node->getAttribute('href');
       }

       //$html = $dom->saveHTML();

    }

I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.

Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!

Added HTML: I am specifically trying to extract <span><a href="https://www.youtube.com/watch?v=nZC4mXaosxM">[link]</a></span> from each table/item. http://pastebin.com/QXa2i6qz

Post your whole code as well as XML file and also make your question quite more clear. — Wolverine
– Wolverine, Commented Dec 8, 2016 at 17:35
So, you want to extract href link from <link rel="self" href="https://www.reddit.com/r/videos/.xml?limit=500&after=t3_10omtd%2F" type="application/atom+xml" />? — Wolverine
– Wolverine, Commented Dec 8, 2016 at 17:49
Negative sir. I am trying to obtain the external link to the actual video (which is part of an html string under the <content> tag. There are 2-3 links in there - 2 of which I don't need and link to reddit itself; the other is the comments. Note: If you can render the rss/xml feed or set of tables in HTML, it would be anything that is <a href="youtube.com or some other video link">[link]</a> — cookie401
– cookie401, Commented Dec 8, 2016 at 17:52
I've added an answer with the code that only works for youtube. If you want to add other video sharing websites, you can add those site names inside the regex expression in the code. — Wolverine
– Wolverine, Commented Dec 8, 2016 at 18:44

Wolverine · Accepted Answer · 2016-12-08 21:07:05Z

2

The following code can extract you all the youtube links from each content.

function extract_youtube_link($xml) {
    $entries = $xml['entry'];
    $videos = [];
    foreach($entries as $entry) {
        $content = html_entity_decode($entry['content']);
        preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
        if(!empty($matches[1][0])) {
            $videos[] = array(
                'entry_title' => $entry['title'],
                'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
                'author_reddit_url' => $entry['author']['uri'],
                'video_url' => $matches[1][0]
            );
        }
    }

    return $videos;
}

$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);

foreach($videos as $video) {
    echo "<p>Entry Title: {$video['entry_title']}</p>";
    echo "<p>Author: {$video['author']}</p>";
    echo "<p>Author URL: {$video['author_reddit_url']}</p>";
    echo "<p>Video URL: {$video['video_url']}</p>";
    echo "<br><br>";
}

The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!

edited Dec 8, 2016 at 21:07

answered Dec 8, 2016 at 18:42

Wolverine

1,7001 gold badge15 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

cookie401 Over a year ago

This is awesome, but unfortunately not all links are from youtube, just most of them. Was just referencing youtube as an example that I want those links, not the reddit ones.

Wolverine Over a year ago

You want all the video links apart from youtube? Can you be bit more specific?

cookie401 Over a year ago

I've achieved what I am trying to do with the links now. Though, if I could make it a bit cleaner, I'd be up for seeing it. Basically, the feed gives a list of table blocks with a few links in them. Two of them link to reddit itself, the other is the link to the actual video on some external site (be it youtube, vimeo, etc.) - I want the video link/hrefs only.

Wolverine Over a year ago

So, you want this external video link from all the content tags? Also, it can be either youtube, vimeo or any other? and all those content tags have video links or some missing?

Wolverine Over a year ago

Yeah. That's right. Never use regex to parse through HTML. I used regex in the code since there is no much HTML here in the content tag in the feed and also it's easy to parse this HTML. Never opt for regex for big HTML code.

|

Pancho Berrizbeitia · Accepted Answer · 2016-12-08 17:39:18Z

0

If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.

http://php.net/manual/es/class.domxpath.php .

answered Dec 8, 2016 at 17:39

Pancho Berrizbeitia

12 bronze badges

Collectives™ on Stack Overflow

Parsing HTML Table Data from XML with PHP

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related