4

I am trying to parse a web page, which contains some JS. Till now I am using Jsoup to parse html in Java, which is working as expected. But I am unable to parse the JavaScript. Below is the snippet of the HTML page-

<script type="text/javascript"> 
var element = document.createElement("input"); 
element.setAttribute("type", "hidden");
element.setAttribute("value", "");
element.setAttribute("name", "AzPwXPs");
element.setAttribute("id", "AzPwXPs");
var foo = document.getElementById("dnipb"); 
foo.appendChild(element);
var element1 = document.createElement("input"); 
element1.setAttribute("type", "hidden");
element1.setAttribute("value", "6D6AB8AECC9B28235F1DE39D879537E1");
element1.setAttribute("name", "ZLZWNK");
element1.setAttribute("id", "ZLZWNK");
foo.appendChild(element1);
</script>

I want to read both the values with their name/id. So that after parsing I can get following results-

AzPwXPs=
ZLZWNK=6D6AB8AECC9B28235F1DE39D879537E1

How to parse in this situation?

3
  • Jsoup only parse HTML. It cannot parse or run JS. Commented May 1, 2013 at 10:47
  • @nhahtdh: Ya, I know that. That is why I am stuck in between... :( But there must some other way around Commented May 1, 2013 at 10:49
  • Run it through a JS parser? Or get a JS engine? (I actually also have the same problem on a side project, but I never got my hand around it...) Commented May 1, 2013 at 10:52

5 Answers 5

6

I have stumbled upon this question few times when searching for the solution to parse pages with JavaScript but the solution provided is not perfect. I have found pure Java solution to the problem by using JBrowserDriver and JSoup to parse JavaScript manipulated page.

Simple example:

    // JBrowserDriver part
    JBrowserDriver driver = new JBrowserDriver(Settings
            .builder().
            timezone(Timezone.EUROPE_ATHENS).build());
    driver.get(FETCH_URL);
    String loadedPage = driver.getPageSource();

    // JSoup parsing part
    Document document = Jsoup.parse(loadedPage);
    Elements elements = document.select("#nav-console span.data");

    log.info("Found element count: {}", elements.size());

    driver.quit();
Sign up to request clarification or add additional context in comments.

1 Comment

Works perfectly
2

I already had the same situation to find url's in css files.

Put the javascript in a string and a apply Regular expressions

Pattern p = Pattern.compile("url\\(\\s*(['" + '"' + "]?+)(.*?)\\1\\s*\\)"); //expression
Matcher m = p.matcher(content);
while (m.find()) {
String urlFound = m.group(); 
}

Regards, Hugo Pedrosa

1 Comment

Although I got the logic but how this can be modified to serve my purpose?
1

Selenium's Webdriver is fantastic: http://docs.seleniumhq.org/docs/03_webdriver.jsp

See this answer for an example of what you are trying to do: Using Selenium Web Driver to retrieve value of a HTML input

4 Comments

What you mean by jQuery is being used? Can you explain me a little bit more please?
the javascript return $('#AzPwXPs')[0] uses a jquery selector ( $('#AzPwXPs')) to find the element.
So, finally in my java parser project there will be Selenium's Webdriver with jquery. Is it?
no, you don't need to use jquery - i'm just going to link to an answer that shows you a better example
1

You can try using query library. Its much more easier with it.

1 Comment

Can you please suggest any one of them? Is it available in Java?
1

Once you've got the text content of the <script> element from JSoup, you can parse the JS using the Caja JS parser and then walk the parse tree to find what you're looking for.

2 Comments

How to get <script> element from JSoup?
@Ravi, If you're doing something like Document doc = Jsoup.parse(...), then doc.getElementsByTag("script").first() should get you the first script in the page.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.