2

I am somewhat new with PHP, but can't really wrap my head around what I am doing wrong here given my situation.

Problem: I am trying to get the href of a certain HTML element within a string of characters inside an XML object/element via Reddit (if you visit this page, it would be the actual link of the video - not the reddit link but the external youtube link or whatever - nothing else).

Here is my code so far (code updated):

Update: Loop-mania! Got all of the hrefs, but am now trying to store them inside a global array to access a random one outside of this function.

function getXMLFeed() {
    echo "<h2>Reddit Items</h2><hr><br><br>";
    //$feedURL = file_get_contents('https://www.reddit.com/r/videos/.xml?limit=200');
    $feedURL = 'https://www.reddit.com/r/videos/.xml?limit=200';
    $xml = simplexml_load_file($feedURL);
    //define each xml entry from reddit as an item
    foreach ($xml -> entry as $item ) {
        foreach ($item -> content as $content) {
            $newContent = (string)$content;    
            $html = str_get_html($newContent);

            foreach($html->find('table') as $table) {
                $links = $table->find('span', '0');
                //echo $links;
                foreach($links->find('a') as $link) {
                    echo $link->href;
                }
            }
        }
    }
}

XML Code: http://pasted.co/0bcf49e8

I've also included JSON if it can be done this way; I just preferred XML: http://pasted.co/f02180db

That is pretty much all of the code. Though, here is another piece I tried to use with DOMDocument (scrapped it).

    foreach ($item -> content as $content) {
       $dom = new DOMDocument();
       $dom -> loadHTML($content);
       $xpath = new DOMXPath($dom);
       $classname = "/html/body/table[1]/tbody/tr/td[2]/span[1]/a";



       foreach ($dom->getElementsByTagName('table') as $node) {
          echo $dom->saveHtml($node), PHP_EOL;
          //$originalURL = $node->getAttribute('href');
       }

       //$html = $dom->saveHTML();

    }

I can parse the table fine, but when it comes to getting certain element's values (nothing has an ID or class), I can only seem to get ALL anchor tags or ALL table rows, etc.

Can anyone point me in the right direction? Let me know if there is anything else I can add here. Thanks!

Added HTML: I am specifically trying to extract <span><a href="https://www.youtube.com/watch?v=nZC4mXaosxM">[link]</a></span> from each table/item. http://pastebin.com/QXa2i6qz

5
  • Post your whole code as well as XML file and also make your question quite more clear. Commented Dec 8, 2016 at 17:35
  • updated as per your requirements. Commented Dec 8, 2016 at 17:44
  • So, you want to extract href link from <link rel="self" href="https://www.reddit.com/r/videos/.xml?limit=500&amp;after=t3_10omtd%2F" type="application/atom+xml" />? Commented Dec 8, 2016 at 17:49
  • Negative sir. I am trying to obtain the external link to the actual video (which is part of an html string under the <content> tag. There are 2-3 links in there - 2 of which I don't need and link to reddit itself; the other is the comments. Note: If you can render the rss/xml feed or set of tables in HTML, it would be anything that is <a href="youtube.com or some other video link">[link]</a> Commented Dec 8, 2016 at 17:52
  • I've added an answer with the code that only works for youtube. If you want to add other video sharing websites, you can add those site names inside the regex expression in the code. Commented Dec 8, 2016 at 18:44

2 Answers 2

2

The following code can extract you all the youtube links from each content.

function extract_youtube_link($xml) {
    $entries = $xml['entry'];
    $videos = [];
    foreach($entries as $entry) {
        $content = html_entity_decode($entry['content']);
        preg_match_all('/<span><a href="(.*)">\[link\]/', $content, $matches);
        if(!empty($matches[1][0])) {
            $videos[] = array(
                'entry_title' => $entry['title'],
                'author' => preg_replace('/\/(.*)\//', '', $entry['author']['name']),
                'author_reddit_url' => $entry['author']['uri'],
                'video_url' => $matches[1][0]
            );
        }
    }

    return $videos;
}

$xml = simplexml_load_file('reddit.xml');
$xml = json_decode(json_encode($xml), true);
$videos = extract_youtube_link($xml);

foreach($videos as $video) {
    echo "<p>Entry Title: {$video['entry_title']}</p>";
    echo "<p>Author: {$video['author']}</p>";
    echo "<p>Author URL: {$video['author_reddit_url']}</p>";
    echo "<p>Video URL: {$video['video_url']}</p>";
    echo "<br><br>";
}

The code outputs in the multidimensional format of array with the elements inside are entry_title, author, author_reddit_url and video_url. Hope it helps you!

Sign up to request clarification or add additional context in comments.

9 Comments

This is awesome, but unfortunately not all links are from youtube, just most of them. Was just referencing youtube as an example that I want those links, not the reddit ones.
You want all the video links apart from youtube? Can you be bit more specific?
I've achieved what I am trying to do with the links now. Though, if I could make it a bit cleaner, I'd be up for seeing it. Basically, the feed gives a list of table blocks with a few links in them. Two of them link to reddit itself, the other is the link to the actual video on some external site (be it youtube, vimeo, etc.) - I want the video link/hrefs only.
So, you want this external video link from all the content tags? Also, it can be either youtube, vimeo or any other? and all those content tags have video links or some missing?
Yeah. That's right. Never use regex to parse through HTML. I used regex in the code since there is no much HTML here in the content tag in the feed and also it's easy to parse this HTML. Never opt for regex for big HTML code.
|
0

If you're looking for a specific element you don't need to parse the whole thing. One way of doing it could be to use the DOMXPath class and query directly the xml. The documentation should guide you through.

http://php.net/manual/es/class.domxpath.php .

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.