1

I have already viewed and tried multiple other threads and doesn't work for me. I need the regex solution for it to work and no java code that does it without regex.

Some of the threads which I have already checked: Get domain name from given url, Extract host name/domain name from URL string, and Java regex to extract domain name? None work for me, either the regex doesn't work or the solution is a java code without regex.

What I am trying to do?

Case 1:
Input: https://api.twitter.com/blog/category/2?user=42&status=enabled
Output: api.twitter.com

Input: abc.xyz.com/blog/category/2?user=42&status=enabled
Output: abc.xyz.com

Case 2:
Input: https://abc.xyz.com/blog/category/2?user=42&status=enabled
Output: xyz.com

Input: abc.xyz.com/blog/category/2?user=42&status=enabled
Output: xyz.com

I need 2 regexes to solve each case mentioned above. If it can be done in one, even that works.

I tried the below regex from the first post:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

This one works when there is https:// or any scheme but fails when there is no scheme in the URL.

So far I am solving the first case using a 2 step solution.

Step 1: Replace scheme
(.*://)(.*) -> $2
remove anything before and including string "://"

Step 2: Extract host name
([^/]*)(.*) -> $1
The first group extracts everything that is before the first "/". Basically extracting everything that isn't slash till I see the first one. 
15
  • 1
    Why is the abc not part of the output in case 2 and how would a regex know when to use which? Commented Jun 8, 2021 at 5:36
  • 2
    Do you understand the regexp you quoted? (As in, enough to debug and maintain it) Commented Jun 8, 2021 at 5:38
  • For case 2, I can use 2 step solution. If case 1 works for me. First, extract the hostname then the domain name from it. Commented Jun 8, 2021 at 5:39
  • 1
    @BikasKatwal: Why does output #2 show abc.xyz.com but output #3 and #4 show only xyz.com? Commented Jun 8, 2021 at 6:30
  • 1
    @anubhava thanks! that works :) Could you add this as the answer? and I will use this .* instead of https. As the scheme can be anything, forgot to mention it in question. Commented Jun 8, 2021 at 7:02

1 Answer 1

4

You may use this regex with optional matches and capture groups:

^(?:\w+://)?((?:[^./?#]+\.)?([^/?#]+))

RegEx Demo

RegEx Details:

  • ^: Start
  • (?:\w+://)?: Optionally match scheme names followed by ://
  • (: Start capture group #1
    • (?:[^./?#]+\.)?: Optionally match first part of domain name using a non-capture group
    • ([^/?#]+): Match 1+ of any character that is not /, ?, # in capture group #2
  • ): End capture group #1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.