0

I've got a file that looks like this:

a    12345
b    3456
c    45678

and i've got bash array of strings:

mylist=("a" "b")

What I want to do is to sum numbers in second column but only for rows where first column value (aka "a" or "b") is present in mylist.

My not-working code:

cat myfile.txt | awk -F'\t' '{BEGIN{sum=0} {if ($1 in ${mylist[@]}) sum+=$2} END{print sum}}'

Expected result is 12345+3456=15801. I understand that problem is in if-statement but can't figure out how to rearrange this code to work.

18
  • awk can't see bash variables; they're two different interpreters in two different processes. It's not clear how you'd expect this to work -- and you don't need awk for the job you're doing anyhow; native bash can do it just fine. Commented Feb 3, 2023 at 14:09
  • Or if you want something faster than native bash when operating on very large input files, the standard UNIX toolkit has join, perfectly well-suited to extracting only the lines you care about. Commented Feb 3, 2023 at 14:11
  • Thanks, Shawn, yes it was my typo, i didn't use them in original code, edited it Commented Feb 3, 2023 at 14:14
  • 1
    (And if you want to quickly check if a bash array contains a string, you should make it an associative array with that string as the key instead of the value; that way it's an O(1) lookup instead of an O(n) one). Commented Feb 3, 2023 at 14:14
  • you actually think an approach involving unnecessary pre-sorting is a good solution to big data joining ? ha Commented Feb 3, 2023 at 15:50

3 Answers 3

2

Doing it in pure bash by making the elements of the original array keys in an associative one:

#!/usr/bin/env bash

mylist=(a b)

# Use the elements of the array as the keys in an associative array
declare -A keys
for elem in "${mylist[@]}"; do
    keys[$elem]=1
done


declare -i sum=0
# Read the lines on standard input
# For example, ./sum.sh < input.txt
while read -r name num; do
    # If the name is a key in the associative array, add to the sum
    if [[ -v keys[$name] ]]; then
        sum+=$num
    fi
done

printf "%d\n" "$sum"
Sign up to request clarification or add additional context in comments.

1 Comment

Maybe also show the alternative declare -A keys=([a]=1 [b]=1) defining an associative array up-front, vs starting from an indexed array and transforming?
2

One method would be:

#!/bin/bash

mylist=(a b)

awk '
    FNR==NR { a[$1]; next }
    $1 in a { sum += $2 }
        END { print sum }
' <(printf '%s\n' "${mylist[@]}") file

Note that, when initializing an array in bash, array elements are separated by whitespaces, not commas.

Comments

1

There's no good reason to make awk read the array in the first place. Let join quickly pick out the matching lines -- that's what it's specialized to do.

And if in real life your array and input file keys are guaranteed to be sorted as they are in the example, you can take the sort uses out of the code below.

# Cautious code that doesn't assume input sort order
LC_ALL=C join -1 1 -2 1 -o1.2 \
  <(LC_ALL=C sort <myfile.txt) \
  <(printf '%s\n' "${mylist[@]}" | LC_ALL=C sort) \
  | awk '{ sum += $1 } END { print sum }'

...or...

# Fast code that requires both the array and the file to be pre-sorted
join -1 1 -2 1 -o1.2 myfile.txt <(printf '%s\n' "${mylist[@]}") \
  | awk '{ sum += $1 } END { print sum }'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.