Thursday, November 15, 2012

Multitask file downloader in Bash in 2 minutes

Suppose you have file with 5,000 urls and you want to download them in parallel, using 50 connections. You can do it using Bash in 2 minutes. Here is how:
MAXJOBS=50
for url in $(cat $UrlsFile); do
    CurJobs=$(jobs | wc -l)
    while [[ "$CurJobs" == "$MAXJOBS" ]]; do
        sleep 0.2
        CurJobs=$(jobs |wc -l)
    done
    curl .... & # download command goes here. NOTE THE "&" sign
done
The idea is plain simple:
  • Use shell's job control to run jobs on background
  • Monitor number of current jobs by simply counting lines in jobs command output
  • If you have 50 jobs already running in parallel - wait until some job will copmlete

Update

Until Bash 4.0 you could wait either for specific job/pid or for all  background jobs. That's why I've used sleep in the above example. Since Bash 4.0 you can use wait -n to wait for any  single job to terminate. So the code can be rewritten in a more optimal way like follows:
MAXJOBS=50
for url in $(cat $UrlsFile); do
    if [[ "$(jobs | wc -l)" != "$MAXJOBS" ]]; then
        wait -n
    fi
    curl .... & # download command goes here. NOTE THE "&" sign    
done

No comments:

Post a Comment