1

I need to create a tool to tokenize an input list. I would be providing a text file containing a list of subdomains in the following manner:

abc.xyc.kkk.com
hjk.pol.lll.kkk.com
...

And the output need to be :

abc
xyz
kkk
hjk
...

The delimiter is '.'

I have tried out the following code, which does not seem to be working:

#!/bin/bash

STRING= cat $1
IFS='.' read -ra VALUES <<< "$STRING"

## To print all values
for i in "${VALUES[@]}"; do
    echo $i
done
4
  • 2
    Can you precisely specify the rules? Why is .com omitted – is it because com is in blacklist or simply because com is the last token in the current whitespace-free block? Also, I assume ….xyc.…… xyz … is a typo – you may want to fix that. Please edit your question using the grey edit button below your question. Commented Nov 28, 2019 at 16:36
  • CHeck your code at shellcheck.net . Good luck. Commented Nov 28, 2019 at 16:36
  • STRING= cat $1 is a duplicate of Why does a space in a variable assignment give an error in bash?, and How do I set a variable to the output of a command in bash?. Commented Nov 28, 2019 at 17:57
  • ...afaict, the rest of what you're referring to as "tokenization" is just splitting on a delimiter, which we also have duplicates already covering. Commented Nov 28, 2019 at 18:01

2 Answers 2

1

Your script had some mistakes that I corrected. Now it works :

#!/bin/bash

STRING='
abc.xyc.kkk.com hjk.pol.lll.kkk.com
'

IFS=' ' read -d '' -a VALUES <<< "$STRING"

for i in ${VALUES[@]}; do
    echo "$i" | sed 's/\./ /g' 
done

Output

abc xyc kkk com
hjk pol lll kkk com

By the way, if you want to have each token in the array, instead of the entire url, you can do this :

#!/bin/bash

STRING='
abc.xyc.kkk.com hjk.pol.lll.kkk.com
'

IFS='.' read -d '' -a VALUES <<< "$STRING"

for i in ${VALUES[@]}; do
    echo "$i" | sed 's/\./ /g' 
done

Output

abc
xyc
kkk
com
hjk
pol
lll
kkk
com

Let me know if it works!

Sign up to request clarification or add additional context in comments.

3 Comments

"${VALUES[@]}", or else it has all the bugs of ${VALUES[*]}.
And it's much more efficient to use ${i//./ } than sed.
More generally, though, a question that doesn't specify one single error with the shortest code that reproduces it should be closed as too-broad, not answered (or closed as duplicate, with a separate duplicate for each specific/narrower question hidden within the larger/broader question).
1

Something like in awk can do the work:

awk 'BEGIN {FS="."; RS=" "} {$NF=""} 1'

Here is the test:

$ echo abc.xyc.kkk.com hjk.pol.lll.kkk.com |awk 'BEGIN {FS="."; RS=" "} {$NF=""} 1'
abc xyc kkk
hjk pol lll kkk

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.