Parallel execution of Unix command?

Question

I wrote one shell program which divide the files in 4 parts automatically using csplit and then four shell program which execute same command in background using nohup and one while loop will look for the completion of these four processes and finally cat output1.txt ....output4.txt > finaloutput.txt

But then i came to know about this command parallel and i tried this with big file but looks like it is not working as expected. This file is an output of below command -

for i in $(seq 1 1000000);do cat /etc/passwd >> data.txt1;done

time wc -l data.txt1
10000000 data.txt1

real    0m0.507s
user    0m0.080s
sys     0m0.424s

with parallel

time cat data.txt1 | parallel  --pipe wc -l | awk '{s+=$1} END {print s}'
10000000

real    0m41.984s
user    0m1.122s
sys     0m36.251s

And when i tried this for 2GB file(~10million) records it took more than 20 minutes.

Does this command only work on multi core system(I am using single core system currently)

nproc --all
1

I am the author. You should try the tests I show below on your computer and another computer. I think your computer is somehow broken. — Ole Tange
– Ole Tange, Commented Feb 6, 2017 at 7:27

Ole Tange · Accepted Answer · 2017-02-06 07:26:41Z

--pipe is inefficient (though not at the scale your are measuring - something is very wrong on your system). It can deliver in the order of 1 GB/s (total).

--pipepart is, on the contrary, highly efficient. It can deliver in the order of 1 GB/s per core, provided your disk is fast enough. This should be the most efficient ways of processing data.txt1. It will split data.txt1 in into one block per cpu core and feed those blocks into a wc -l running on each core:

parallel  --block -1 --pipepart -a data.txt1 wc -l

You need version 20161222 or later for block -1 to work.

These are timings from my old dual core laptop. seq 200000000 generates 1.8 GB of data.

$ time seq 200000000 | LANG=C wc -c
1888888898

real    0m7.072s
user    0m3.612s
sys     0m2.444s

$ time seq 200000000 | parallel --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898

real    1m28.101s
user    0m25.892s
sys     0m40.672s

The time here is mostly due to GNU Parallel spawning a new wc -c for each 1 MB block. Increasing the block size makes it faster:

$ time seq 200000000 | parallel --block 10m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898

real    0m26.269s
user    0m8.988s
sys     0m11.920s

$ time seq 200000000 | parallel --block 30m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898

real    0m21.628s
user    0m7.636s
sys     0m9.516s

As mentioned --pipepart is much faster if you have data in a file:

$ seq 200000000 > data.txt1
$ time parallel --block -1 --pipepart -a data.txt1 LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898

real    0m2.242s
user    0m0.424s
sys     0m2.880s

So on my old laptop I can process 1.8 GB in 2.2 seconds.

If you have only one core and your work is CPU dependent, then parallelizing will not help you. Parallelizing on a single core machine can make sense if most of the time is spent waiting (e.g. waiting for the network).

However, the timings from your computer tells me something is very wrong with that. I will recommend you test your program on another computer.

Mandar · Accepted Answer · 2017-02-07 04:37:55Z

0

In short yes.. You will need more physical cores on the machines to get benefit from the parallel. Just for understanding your task ; following is what you intend to do

file1 is a 10,000,000 line file

split into 4 files > 
file1.1  > processing > output1
file1.2  > processing > output2
file1.3  > processing > output3
file1.4  > processing > output4

>> cat output* > output 
________________________________

And You want to parallelize the middle part and run it on 4 cores (hopefully 4 cores) simultaneously. Am I correct? I think you can use GNU parallel in much better way write a code for 1 of the files and use that command with (psuedocode warning )

parallel --jobs 4 "processing code on the file segments with sequence variable {}"  ::: 1 2 3 4

Where -j is for number of processors.

UPDATE Why are you trying parallel command for sequential execution within your file1.1 1.2 1.3 and 1.4?? Let it be regular sequential processing as you have coded

parallel 'for i in $(seq 1 250000);do cat file1.{} >> output{}.txt;done' ::: 1 2 3 4

The above code will run your 4 segmented files from csplit in parallel on 4 cores

for i in $(seq 1 250000);do cat file1.1 >> output1.txt;done
for i in $(seq 1 250000);do cat file1.2 >> output2.txt;done
for i in $(seq 1 250000);do cat file1.3 >> output3.txt;done
for i in $(seq 1 250000);do cat file1.4 >> output4.txt;done

I am pretty sure that --diskpart as suggested above by Ole is the better way to do it ; given that you have high speed data access from HDD.

edited Feb 7, 2017 at 4:37

answered Jan 29, 2017 at 23:37

Mandar

1,7791 gold badge10 silver badges14 bronze badges

2 Comments

VIPIN KUMAR Over a year ago

I tried your command but it is not working, it is throwing error (No file or directory) time cat data.txt1 | parallel --jobs 4 wc -l | awk '{s+=$1} END {print s}' but below command is working fine but it is taking more time - time cat data.txt1 | parallel --pipe wc -l | awk '{s+=$1} END {print s}'

Mandar Over a year ago

One more bottleneck can be you HDD read speeds. If you are starting 4 threads on a regular HDD (no raid) it can be slower too. Btw why the heck are you piping cat data.txt1 into parallel. You can create first file1.1 1.2 1.3 and 1.4. Once that is complete, then create a parallel run on all 4 files simultaneously and get the output1..2..3..4 ... Wait on that process to finish. and then concatenate output. My guess is you are using parallel all wrong.

Collectives™ on Stack Overflow

Parallel execution of Unix command?

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related