2

I need to extract some data from an website and then save some values in variables.

Here you've got the code:

public class Principal {

 public static void main(String[] args) throws IOException {

    URL url = new URL("http://www.numbeo.com/cost-of-living/country_result.jsp?country=Turkey");
    URLConnection yc = url.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
    String inputLine;
            String valor;
            String str = null;

    while ((inputLine = in.readLine()) != null) {
                    if(inputLine.contains("Milk"))
                    {
                         System.out.println("Encontrei! " + inputLine );
                         valor=inputLine.substring(inputLine.lastIndexOf("\"priceValue\">") + 14);
                         System.out.println("valor:" +valor);

                    }

            }
    in.close();
    }

  }

First inputline print this: <tr class="tr_standard"><td>Milk (regular), (1 liter) </td> <td style="text-align: right" class="priceValue"> 2.45&nbsp;TL</td>

Now I've got to extract just the "2.45" how can I do that? I already tried with some Regex but can't make it work. Sorry for my English. Thanks in advance.

3
  • What regex have you tried? Commented Nov 17, 2015 at 22:15
  • The best I've got was with ("\\D+",""); But it removes the dot Commented Nov 17, 2015 at 22:19
  • I know this is not what you are asking, but seems like your application could benefit a lot if using an actual XML parser. Commented Nov 17, 2015 at 22:26

2 Answers 2

2

You can try following regex:

(?:class="priceValue">\s*)(\d*\.\d+)

It looks for a class="priceValue"string followed by a price

Here is DEMO and explanation

Sign up to request clarification or add additional context in comments.

4 Comments

Hi, thanks! I tried like this `str = valor.replaceAll("(?:class=\"priceValue\">\\s+)([\\d.]+)",""); System.out.println("valor:" +str);´ But the println shows: valor:2.45&nbsp;TL</td>
You should use matcher
like this? ` valor.matches("(?:class=\"priceValue\">\\s+)([\\d.]+)");`
I would discourage using regex for HTML parsing, even in such a simple case. It's usually much more complicated than it looks, and just an anti-pattern. What happens if the class attribute is not the last one in an element? It would still be valid, but this solution would not work - see a test regex with this condition, and no matching result. For a good (and humorous) reference, please see [this question about using regex to parse [X]HTML](stackoverflow.com/a/1732454/1663942). I would recommend @JockX's answer.
2

I know you are asking for regex, but consider making your life easier by parsing the HTML as if it was a structured XML document it is rather than a normal string. There are libraries that would handle this for you, and stop you from worrying about text formatting, legal linebreaks and other stuff:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.1</version>
</dependency>

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class HtmlParser {
    public static void main(String[] args) {

        Document doc;
        try {
            doc = Jsoup.connect("http://www.numbeo.com/cost-of-living/country_result.jsp?country=Turkey").get();
            Elements rows = doc.select("table.data_wide_table tr.tr_standard"); // CSS selector to find all table rows
            for (Element row : rows) {
                System.out.println("Item name: " + row.child(0).text()); // Milk will be here somewhere
                System.out.println("  Item price by column number: " + row.child(1).text());
                System.out.println("  Item price by column class:  " + row.getElementsByAttributeValue("class", "priceValue").get(0).text());
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

/**
 Output:
 Item name: Meal, Inexpensive Restaurant
   Item price by column number: 15.00 TL
   Item price by column class: 15.00 TL
 Item name: McMeal at McDonalds (or Equivalent Combo Meal)
  Item price by column number: 15.00 TL
  Item price by column class: 15.00 TL
...
*/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.