How to create a tokenization tool in bash script? [duplicate]

Question

I need to create a tool to tokenize an input list. I would be providing a text file containing a list of subdomains in the following manner:

abc.xyc.kkk.com
hjk.pol.lll.kkk.com
...

And the output need to be :

abc
xyz
kkk
hjk
...

The delimiter is '.'

I have tried out the following code, which does not seem to be working:

#!/bin/bash

STRING= cat $1
IFS='.' read -ra VALUES <<< "$STRING"

## To print all values
for i in "${VALUES[@]}"; do
    echo $i
done

Can you precisely specify the rules? Why is .com omitted – is it because com is in blacklist or simply because com is the last token in the current whitespace-free block? Also, I assume ….xyc.… → … xyz … is a typo – you may want to fix that. Please edit your question using the grey edit button below your question. — Socowi
– Socowi, Commented Nov 28, 2019 at 16:36
STRING= cat $1 is a duplicate of Why does a space in a variable assignment give an error in bash?, and How do I set a variable to the output of a command in bash?. — Charles Duffy
– Charles Duffy, Commented Nov 28, 2019 at 17:57
...afaict, the rest of what you're referring to as "tokenization" is just splitting on a delimiter, which we also have duplicates already covering. — Charles Duffy
– Charles Duffy, Commented Nov 28, 2019 at 18:01

Matias Barrios · Accepted Answer · 2019-11-28 16:55:28Z

1

Your script had some mistakes that I corrected. Now it works :

#!/bin/bash

STRING='
abc.xyc.kkk.com hjk.pol.lll.kkk.com
'

IFS=' ' read -d '' -a VALUES <<< "$STRING"

for i in ${VALUES[@]}; do
    echo "$i" | sed 's/\./ /g' 
done

Output

abc xyc kkk com
hjk pol lll kkk com

By the way, if you want to have each token in the array, instead of the entire url, you can do this :

#!/bin/bash

STRING='
abc.xyc.kkk.com hjk.pol.lll.kkk.com
'

IFS='.' read -d '' -a VALUES <<< "$STRING"

for i in ${VALUES[@]}; do
    echo "$i" | sed 's/\./ /g' 
done

Output

abc
xyc
kkk
com
hjk
pol
lll
kkk
com

Let me know if it works!

answered Nov 28, 2019 at 16:55

Matias Barrios

5,0864 gold badges31 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Charles Duffy Over a year ago

"${VALUES[@]}", or else it has all the bugs of ${VALUES[*]}.

Charles Duffy Over a year ago

And it's much more efficient to use ${i//./ } than sed.

Charles Duffy Over a year ago

More generally, though, a question that doesn't specify one single error with the shortest code that reproduces it should be closed as too-broad, not answered (or closed as duplicate, with a separate duplicate for each specific/narrower question hidden within the larger/broader question).

Romeo Ninov · Accepted Answer · 2019-11-28 16:47:46Z

1

Something like in awk can do the work:

awk 'BEGIN {FS="."; RS=" "} {$NF=""} 1'

Here is the test:

$ echo abc.xyc.kkk.com hjk.pol.lll.kkk.com |awk 'BEGIN {FS="."; RS=" "} {$NF=""} 1'
abc xyc kkk
hjk pol lll kkk

answered Nov 28, 2019 at 16:47

Romeo Ninov

7,3791 gold badge29 silver badges36 bronze badges

Collectives™ on Stack Overflow

How to create a tokenization tool in bash script? [duplicate]

2 Answers 2

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Linked

Related