Java url domain parsing with regex

Question

I want to parse a Url's domain (without 'www') with regex and return it. There are many examples for it on stackoverflow but they do not provide solution for all cases below or some of them has unneccessary features. My cases are:

http://www.google.co.uk      pass
http://www.google.co.uk      pass
http://google.com.co.uk      pass
same for https               pass
google.co.uk                 pass
www.google.co.uk             pass

and all must return only part of domain google.co.uk There is no need for links like 101.34.24.. or starting for fps etc... Only allowed input formats are at above. And i validate url with regex : ^(https?:\/\/)?(www\.)?([\w]+\.)+[‌\w]{2,63}\/?$ and it is working good but i do not know how to parse it.

Note: I would be happy if you do not recommend URI or URL classes and their methods for parsing domain automatically like:

private String parseUrl(String url) throws URISyntaxException {
        if (url.startsWith("http:/")) {
            if (!url.contains("http://")) {
                url = url.replaceAll("http:/", "http://");
            }
        } else if (url.startsWith("https:/")) {
            url = url.replaceAll("https:/", "http:/");
        } else {
            url = "http://" + url;
        }
        URI uri = new URI(url);
        String domain = uri.getHost();
        return domain.startsWith("www.") ? domain.substring(4) : domain;
    }

This code works perfectly as well but i need regex not this one.

Just curious: what is the use case that for you is better to use regex instead of the correct way of using an URI/URL parser. — Garis M Suero
– Garis M Suero, Commented Dec 14, 2018 at 17:36

Pushpesh Kumar Rajwanshi · Accepted Answer · 2018-12-14 17:22:32Z

3

Your regex,

^(https?:\/\/)?(www\.)?([\w]+\.)+[‌\w]{2,63}\/?$

matches the input but doesn't capture the intended domain in a group properly. You can modify it and make it simple like this,

^(?:https?:\/\/)?(?:www\.)?((?:[\w]+\.)+\w+)

which captures your intended domain capture in group 1.

Live Demo

Here is a sample Java code using extracts and prints domain name,

public static void main(String[] args) throws SQLException {
    Pattern p = Pattern.compile("^(?:https?:\\/\\/)?(?:www\\.)?((?:[\\w]+\\.)+\\w+)");
    List<String> list = Arrays.asList("http://www.google.co.uk", "http://www.google.co.uk",
            "http://google.com.co.uk", "https://www.google.co.uk", "https://www.google.co.uk",
            "https://google.com.co.uk");

    list.forEach(x -> {
        Matcher m = p.matcher(x);
        if (m.matches()) {
            System.out.println(x + " --> " +m.group(1));
        }
    });
}

Prints,

http://www.google.co.uk --> google.co.uk
http://www.google.co.uk --> google.co.uk
http://google.com.co.uk --> google.com.co.uk
https://www.google.co.uk --> google.co.uk
https://www.google.co.uk --> google.co.uk
https://google.com.co.uk --> google.com.co.uk

answered Dec 14, 2018 at 17:22

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

abidinberkay Over a year ago

thanks but when i try it no error comes but when i debug it goes to 'invocationtargetexception, and then 'java.lang.illegalstateexception: no match found' and not working :(

abidinberkay Over a year ago

i have tried as you wrote (with println) it is working and parsing but when i use it with my code m.group is not working and throwing exception

abidinberkay Over a year ago

ah works now. i have restarted ide and started to work. intellij bug

abidinberkay Over a year ago

hey do you have any idea about stackoverflow.com/questions/54330887/… ? I just need parse the links including / and further parts as well

Pushpesh Kumar Rajwanshi Over a year ago

@abidinberkay: I was just reading your new post only :) Let me reply you there

jamierocks · Accepted Answer · 2018-12-14 17:22:22Z

2

The solution is to add a capturing group to cover that section of the URL, ^(https?://)?(www\.)?(([\w]+\.)+[‌\w]{2,63})/?$ would work here.

Beyond that, you just need to use a Matcher to grab the correct group (group 3 here):

private static Pattern URL_PATTERN =
        Pattern.compile("^(https?://)?(www\.)?(([\w]+\.)+[‌\w]{2,63})/?$");

public static String minifyUrl(final String url) {
    final Matcher matcher = URL_PATTERN.matcher(url);
    if (matcher.find()) return matcher.group(3);
    else return url;
}

However, I still think you would be better served by using Java's URL class :p

answered Dec 14, 2018 at 17:22

jamierocks

6955 silver badges16 bronze badges

Collectives™ on Stack Overflow

Java url domain parsing with regex

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related