0

I need to get the value between href's double quotes(") that matches a specific pattern, I tried the above but I can't figure out what's wrong. When I find the pattern in the same line multiple times I get a huge group with information that I don't want:

href="(/namehere/nane2here/(option1|option2).*)"

I need the group between the parenthesis. This pattern repeats itself a lot of times in the string, they're all in the same line.

Example of a string I'm trying to get the values from:

<div>adasdsda<div>...lots of tags here... <a ... href="/name/name/option1/data1/data2"...anything here ...">src</a>...others HTML text here...<a ... href="/name/name/option2/data1"...
5
  • [..] is character set which allows us to match only single character among specified inside set. For instance if you have [abc] it will be able to match either single a or b or c, not abc. So instead of character set at [option1|option2] you ware probably looking for group like (option1|option2). Commented Jun 18, 2020 at 22:01
  • @Pshemo, I tried it, but it didn't solved my problem. When I find the pattern in the same line multiple times I get a huge group with information that I don't want. Commented Jun 18, 2020 at 22:03
  • 1
    Change .* to [^\"]*. Commented Jun 18, 2020 at 22:14
  • is it not sufficient to capture all href (ie. href=".+?", maybe even capture the url in a group) and then filter for what you're looking for. So 3 steps, pluck the urls, filter the urls, do your thingamaginga. Commented Jun 18, 2020 at 22:16
  • Thanks, @saka1029, it seems to have solved my issue. Commented Jun 18, 2020 at 22:18

3 Answers 3

1

First of all, don't use regex on entire HTML structure. To learn why visit:

Instead try to parse HTML structure into object representing DOM which will let us easily traverse over all elements and find those which we are interested in.

One of (IMO) easiest to use HTML parsers can be found at https://jsoup.org/. Its big plus is support for CSS selector syntax to find elements. It is described at https://jsoup.org/cookbook/extracting-data/selector-syntax where we can find

[attr~=regex]: elements with attribute values that match the regular expression; e.g.
img[src~=(?i)\.(png|jpe?g)]

In short [attr~=regex] will let us fund any element whose value of specified attribute can be even partially matched by regex.

With this your code can look something like:

String yourHTML =
        "<div>" +
        "   <a href='abc/def/1'>foo</a>" +
        "   <a href='abc/fed/2'>bar</a>" +
        "   <a href='abc/ghi/3'>bam</a>" +
        "</div>";
Document doc = Jsoup.parse(yourHTML);
Elements elementsWithHref = doc.select("a[href~=^abc/(def|fed)]");
for (Element element : elementsWithHref){
    String href = element.attr("href");
    System.out.println(href);
}

Output:

abc/def/1
abc/fed/2

(notice that there is no abc/ghi/3 since ^abc/(def|fed) can't be found in it)

Sign up to request clarification or add additional context in comments.

2 Comments

The problem is that I cant use an HTML parser? eheheheh :). I believe I can build my own, but I can't do it in the time I have. Thank you very much for the justification.
@Wally Well, to be honest regex and HTML can work OK in case of simple HTML documents which structure is always the same (or at least you know it very well and can handle its traps). But generally it is safer option to use HTML parser.
0

Try "(?si)<[\\w:]+(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\\s)href\\s*=\\s*(?:(['\"])\\s*((?:(?!\\1).)*?/namehere/nane2here/(?:option1|option2)(?:(?!\\1).)*)\\s*\\1))\\s+(?:\".*?\"|'.*?'|[^>]*?)+>"

demo

feature :

  • finds specific href value contained in any tag
  • group 1 contains delimiter
  • group 2 contains the href value

1 Comment

just uses regex to generally operate on tags, any that have the href="value" inside. this regex is proven effective and is built from tag definitions from standard html.
0

\b is used to matche a word boundary

href="(/namehere/nane2here/(\\boption1\\b)|(\\boption2\\b).*)"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.