2

I want to parse a Url's domain (without 'www') with regex and return it. There are many examples for it on stackoverflow but they do not provide solution for all cases below or some of them has unneccessary features. My cases are:

http://www.google.co.uk      pass
http://www.google.co.uk      pass
http://google.com.co.uk      pass
same for https               pass
google.co.uk                 pass
www.google.co.uk             pass

and all must return only part of domain google.co.uk There is no need for links like 101.34.24.. or starting for fps etc... Only allowed input formats are at above. And i validate url with regex : ^(https?:\/\/)?(www\.)?([\w]+\.)+[‌​\w]{2,63}\/?$ and it is working good but i do not know how to parse it.

Note: I would be happy if you do not recommend URI or URL classes and their methods for parsing domain automatically like:

private String parseUrl(String url) throws URISyntaxException {
        if (url.startsWith("http:/")) {
            if (!url.contains("http://")) {
                url = url.replaceAll("http:/", "http://");
            }
        } else if (url.startsWith("https:/")) {
            url = url.replaceAll("https:/", "http:/");
        } else {
            url = "http://" + url;
        }
        URI uri = new URI(url);
        String domain = uri.getHost();
        return domain.startsWith("www.") ? domain.substring(4) : domain;
    }

This code works perfectly as well but i need regex not this one.

2
  • 2
    Just curious: what is the use case that for you is better to use regex instead of the correct way of using an URI/URL parser. Commented Dec 14, 2018 at 17:36
  • 3
    just a brainless boss. is it valid reason? :) Commented Dec 14, 2018 at 17:45

2 Answers 2

3

Your regex,

^(https?:\/\/)?(www\.)?([\w]+\.)+[‌​\w]{2,63}\/?$

matches the input but doesn't capture the intended domain in a group properly. You can modify it and make it simple like this,

^(?:https?:\/\/)?(?:www\.)?((?:[\w]+\.)+\w+)

which captures your intended domain capture in group 1.

Live Demo

Here is a sample Java code using extracts and prints domain name,

public static void main(String[] args) throws SQLException {
    Pattern p = Pattern.compile("^(?:https?:\\/\\/)?(?:www\\.)?((?:[\\w]+\\.)+\\w+)");
    List<String> list = Arrays.asList("http://www.google.co.uk", "http://www.google.co.uk",
            "http://google.com.co.uk", "https://www.google.co.uk", "https://www.google.co.uk",
            "https://google.com.co.uk");

    list.forEach(x -> {
        Matcher m = p.matcher(x);
        if (m.matches()) {
            System.out.println(x + " --> " +m.group(1));
        }
    });
}

Prints,

http://www.google.co.uk --> google.co.uk
http://www.google.co.uk --> google.co.uk
http://google.com.co.uk --> google.com.co.uk
https://www.google.co.uk --> google.co.uk
https://www.google.co.uk --> google.co.uk
https://google.com.co.uk --> google.com.co.uk
Sign up to request clarification or add additional context in comments.

5 Comments

thanks but when i try it no error comes but when i debug it goes to 'invocationtargetexception, and then 'java.lang.illegalstateexception: no match found' and not working :(
i have tried as you wrote (with println) it is working and parsing but when i use it with my code m.group is not working and throwing exception
ah works now. i have restarted ide and started to work. intellij bug
hey do you have any idea about stackoverflow.com/questions/54330887/… ? I just need parse the links including / and further parts as well
@abidinberkay: I was just reading your new post only :) Let me reply you there
2

The solution is to add a capturing group to cover that section of the URL, ^(https?://)?(www\.)?(([\w]+\.)+[‌​\w]{2,63})/?$ would work here.

Beyond that, you just need to use a Matcher to grab the correct group (group 3 here):

private static Pattern URL_PATTERN =
        Pattern.compile("^(https?://)?(www\.)?(([\w]+\.)+[‌​\w]{2,63})/?$");

public static String minifyUrl(final String url) {
    final Matcher matcher = URL_PATTERN.matcher(url);
    if (matcher.find()) return matcher.group(3);
    else return url;
}

However, I still think you would be better served by using Java's URL class :p

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.