pv – Pipe Viewer – My New Favourite Command Line Tool

Pipe Viewer I’ve got a rather large dataset that I need to do a lot of processing on, over several iterations, it’s a 20gb zip file, flat text, and I’m impatient and don’t like not knowing things!

My new favourite Linux command line tool, pv (pipe viewer) is totally awesome. Check this out:


pv -cN source < urls.gz | zcat | pv -cN zcat | perl -lne '($a,$b,$c,$d) = split /\||\t/; print $b unless $b =~ /ac\.uk/; print $c unless $c =~ /ac\.uk/' | pv -cN perl | gzip | pv -cN gzip > hosts.gz
zcat: 93.4GiB 1:33:18 [26.6MiB/s] [ <=> ]
perl: 85.7GiB 1:33:18 [25.3MiB/s] [ <=> ]
source: 13.2GiB 1:33:17 [3.57MiB/s] [===============================================> ] 67% ETA 0:44:41
gzip: 12.7GiB 1:33:18 [3.51MiB/s] [ <=> ]

I’m basically splitting some text, removing stuff I don’t want and doing:

zcat urls.gz | perl -lne '($a,$b,$c,$d) = split /\||\t/; print $b unless $b =~ /ac\.uk/; print $c unless $c =~ /ac\.uk/' | gzip > hosts.gz

But at appropriate moments I’ve piped the output in to the pv pipe viewer tool to report on some metrics. FYI the -N flag lets me set a name for the pv instance, and the -c flag is to enable cursor positioning so we can use multiple instances of pv!

The reason pipe viewer is totally cool is the extra sneaky data we get!

Pipe Viewer Is Magic

Because the first instance of pv is reading our urls.gz file in itself, it can display how much of the file it’s processed and roughly how long it will complete. MOST USEFUL THING EVER! Also I had no idea how large the compressed dataset was and was hesitant to uncompress the data as I wasn’t sure how big it would be, we can see from the pv instance named zcat that zcat has so far spat out 93.4GB of data, at 67% through we can predict this file is probably around 140GB if we extract it. How cool is that? We can also tell from the pv named perl that after splitting and removing the data we don’t want, we’ve so far shaved off 10GB, which is kinda interesting to splurge over for a bit, and lastly with the named gzip pv instance, pipe viewer is telling us the size of the output file we’ve generated so far.

This is totally rad.

Note. Many thanks to Norway for forcing me to rewrite my initial one liner of

zcat urls.gz | sed 's/|/ /g' | while read a b c d ; do echo $b ; echo $c ; done | grep -v ac.uk$ | gzip > hosts.gz

by glaring at me.

About rus

Arrogant, narcissistic and imperatively logical. I first started coding in the mid 80s on an Amstrad 6128, entering games found in the back of Amstrad Action.After watching Hackers and falling in love with Angelina Jolie I installed Slackware 2.0 on a P200 in 1997and spent the next 6-7 years studying computery things at various colleges and universities.Several years later I can now be found in an office premises by day sat in front of a Macbook, using a Windows VM to manage Linux servers, or in a field by night, fire dancing and holding pyrotechnics casually in my hands whilst they explode.