4

I am trying to scrape a webpage in JavaScript which looks as follows:

enter image description here

The code shown is part of a larger loop, that loops through each repo and scrapes it's contents. I've confirmed that I'm able to capture the first element of every repo item on the page (so the javascript of "33-js-concepts", the react of "playground", the react of "react-google-static", etc.) and can scrape the all items in the first repo (so javascript, concepts, nodejs, react, angular, etc.) but keep getting this error with subsequent loops. Here is my code:

r.topic = []; // topics used in the repo:
var topics = $('.topics-row-container > a', parent);
    if(topics && topics.length > 0) {
      for (var i in topics) {
        r.topic.push(topics[i].children[0].data.replace(/^\s+|\s+$/g, ''));
        
    }
    console.log(r.topic);

The first loop produces the expected result, with console.log(r.topic) printing:

[
  'javascript',
  'concepts',
  'nodejs',
  'react',
  'angular',
  'programming',
  'javascript-programming'
]

But subsequent loops produce the following error:

r.topic.push(topics[i].children[0].data.replace(/^\s+|\s+$/g, ''));
                                       ^
TypeError: Cannot read property '0' of undefined

I'm new to javascript so am thinking I'm missing something obvious but I can't understand why children would be throwing this error. I even tried making it so children would increment by one with each loop, but I still saw the same error.

I would really appreciate any help!

UPDATE: topics printed to the console looks as follows:

children: [ [Node] ],
    parent: Node {
      type: 'tag',
      name: 'div',
      namespace: 'http://www.w3.org/1999/xhtml',
      attribs: [Object: null prototype],
      'x-attribsNamespace': [Object: null prototype],
      'x-attribsPrefix': [Object: null prototype],
      children: [Array],
      parent: [Node],
      prev: [Node],
      next: [Node]
    },
    prev: Node {
      type: 'text',
      data: '\n          ',
      parent: [Node],
      prev: [Node],
      next: [Circular *7]
    },
    next: Node {
      type: 'text',
      data: '\n      ',
      parent: [Node],
      prev: [Circular *7],
      next: null
    }
  },
  options: { xml: false, decodeEntities: true },
  _root: <ref *8> initialize {
    '0': Node {
      type: 'root',
      name: 'root',
      parent: null,
      prev: null,
      next: null,
      children: [Array],
      'x-mode': 'no-quirks'
    },
5
  • 2
    But in this particular case, there is no need to parse scrape anything. Github has a public api, even cross origin enabled. Thus you can use fetch and few lines to obtain the result you want. Commented Mar 20, 2021 at 3:26
  • Yo @Erin, were you able to resolve this? Commented Mar 21, 2021 at 3:50
  • @fortuneee No, I'm getting a reference error stating: "ReferenceError: document is not defined" and I can't figure out out how to define the document. The parent node is defined like this: var parent = '.source:nth-child(' + i +') '; Commented Mar 21, 2021 at 14:11
  • @Erin whats the link to the page you want to scrape? Commented Mar 29, 2021 at 12:28
  • @John this is one such page: github.com/leonardomso?tab=repositories . Any repo page with the "?tab=repositories" extension is what I'm looking at Commented Mar 29, 2021 at 13:13

4 Answers 4

1
+50

If you just need the info now and this isn't part of a larger site that will do this routinely, you could just:

if (topics[i] && topics[i].children && 
    topics[i].children[0] && topics[i].children[0].data)
    r.topic.push(topics[i].children[0].data.replace(/^\s+|\s+$/g, ''));

It's not finding some element. If you want to really look for what it's happening in order to leave this working for all cases, you could:

r.topic = []; // topics used in the repo:
var topics = $('.topics-row-container > a', parent);
try {
    if(topics && topics.length > 0) {
        for (var i in topics) {
            r.topic.push(topics[i].children[0].data.replace(/^\s+|\s+$/g, ''));
        }
        console.log(r.topic);
    }
} catch(error) {
    console.log(error, topics);
}  

Then when it fails, you can check the topic structure and see where it failed, so you can enhance you loop to treat that specific case. I could make an working example if you can provide the site you're running this against or the contents of the topics var both when it succeds and when it fails.

If you decide to share this info with us, please don't post it on the question. Use pastebin.com or something.

Sign up to request clarification or add additional context in comments.

1 Comment

@ Nelson thank you so much! That was exactly it.
0

This $('.topics-row-container > a', parent); mostly like doesn't return an array of all those elements, which sorta result in a loop of an object as opposed to an array, when you do for/in.

You need a way to return an array of all these 👉 '.topics-row-container > a' elements.

And you can use document.querySelectorAll(),

So, technically this line:

var topics = $('.topics-row-container > a', parent);

could look something like:

var topics = parent.querySelectorAll('.topics-row-container > a');

2 Comments

Right! $('.topics-row-container > a', parent); is not returning an array, that's why I have: topics[i].children[0].data.replace(/^\s+|\s+$/g, ''). <- this does return an array on the first loop, but not subsequent loops. I tried your suggested answer and got "TypeError: parent.querySelectorAll is not a function" :( Thank you so much for your input though! @fortunee
Try document.querySelectorAll()
0

Base example, fetching the Github api for repositories tagged javascript.

fetch('https://api.github.com/search/repositories?q=javascript')
  .then(v => v.json()).then((v) => {
     console.log(v)
  }
)

1 Comment

I didn't want to use the public api as it has a few limitations I was hoping to bypass. But thank you so much for your input! @NVRM
0

I'm lacking code and a link to the Github page to reproduce the error. Just by looking at the error message it seems that .children is undefined (because the node has no children?).

How about just skipping these nodes?

r.topic = []; // topics used in the repo:
const topics = $('.topics-row-container > a', parent);
if(topics && topics.length > 0) {
  for (const topic of topics) {
    if(!topic.children) {
      // you could `console.log(topic)` here to debug why `.children` is undefined
      continue
    }
    const [firstChild] = topic.children
    r.topic.push(firstChild.data.replace(/^\s+|\s+$/g, ''));
  }
  console.log(r.topic);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.