Discussion:
bug#33281: head does not consume input after '-c' is satisfied
(too old to reply)
Luiz Angelo Daros de Luca
2018-11-05 20:30:17 UTC
Permalink
Hello,

Once head read enough bytes to satisfy -c option, it stops reading input
and quit.
This is different from what -n does and it is also different from both
FreeBSD and busybox head implementation.

With GNU Coreutils head:

$ echo -e "123\n456\n789" | { head -n 1; while read a; do echo "-$a-";
done; }
123
$ echo -e "123\n456\n789" | { head -c 4; while read a; do echo "-$a-";
done; }
123
-456-
-789-
$

With all other head implementations I tested:

$ echo -e "123\n456\n789" | { head -c 4 ; while read a ; do echo "-$a-" ;
done ; }
123
$

It would make sense to both -n and -c have the same meaning, differing only
whether to read bytes or lines.

Regards,
--
Luiz Angelo Daros de Luca
***@gmail.com
Philip Rowlands
2018-11-05 21:17:49 UTC
Permalink
Post by Luiz Angelo Daros de Luca
Once head read enough bytes to satisfy -c option, it stops reading input
and quit.
This is different from what -n does and it is also different from both
FreeBSD and busybox head implementation.
$ echo -e "123\n456\n789" | { head -n 1; while read a; do echo "-$a-";
done; }
123
This is incomplete; head doesn't read everything, but more than one line. On my (rather aged Linux) system:
$ head --version
head (GNU coreutils) 8.25

$ seq 1864 | { head -n 1; while read a; do echo "-$a-"; done; }
1
--
-1861-
-1862-
-1863-
-1864-

What's special about 1860 lines of output? It's just over the amount of data which head reads from the pipe, 8192 bytes.

$ seq 1860 | wc -c
8193
Post by Luiz Angelo Daros de Luca
$ echo -e "123\n456\n789" | { head -c 4; while read a; do echo "-$a-";
done; }
123
-456-
-789-
In this case head knows it only needs 4 bytes, so only reads 4 bytes.
Post by Luiz Angelo Daros de Luca
$ echo -e "123\n456\n789" | { head -c 4 ; while read a ; do echo "-$a-" ;
done ; }
123
$
It would make sense to both -n and -c have the same meaning, differing only
whether to read bytes or lines.
Consistency would be good, but consider in the case of lines, head doesn't know up-front how much data to read. The only way to read exactly the right amount, not a byte more, would be to read one byte at a time, something of a performance killer. It's not possible to "un-read" data you've collected via the read syscall.

To achieve consistency in the other direction, head could ignore the optimization to reduce the number of bytes read, and always read 8192 bytes, knowing that some would be discarded. This seems to be more in line with the other implementations you've tried.

For consistency's sake, what would these do? For widely differing values, the only way to produce the same residual output would be to consume all input data.
$ cat file.txt | { head -n 100; wc -c; }
$ cat file.txt | { head -c 100KB; wc -c; }


Cheers,
Phil
Bernhard Voelker
2018-11-06 07:06:38 UTC
Permalink
Post by Philip Rowlands
Post by Luiz Angelo Daros de Luca
Once head read enough bytes to satisfy -c option, it stops reading input
and quit.
This is different from what -n does and it is also different from both
FreeBSD and busybox head implementation.
$ echo -e "123\n456\n789" | { head -n 1; while read a; do echo "-$a-";
done; }
123
$ head --version
head (GNU coreutils) 8.25
$ seq 1864 | { head -n 1; while read a; do echo "-$a-"; done; }
1
--
-1861-
-1862-
-1863-
-1864-
What's special about 1860 lines of output? It's just over the amount of data which head reads from the pipe, 8192 bytes.
Indeed, running 'head' via 'strace' seconds that:

read(0, "1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14"..., 8192) = 8192

... and: 'head' tries to "undo" the reading by calling lseek(),
but that typically fails as stdin is a pipe:

lseek(0, -8190, SEEK_CUR) = -1 ESPIPE (Illegal seek)

Thus said, if your input was a regular file, then this positioning back to
where the newline "\n" was would succeed:

$ file=$(mktemp) \
&& seq 4 > "$file" \
&& { strace -ve read,lseek head -n 1; while read a; do echo "-$a-"; done; } < "$file" \
; rm -f "$file"
...
read(0, "1\n2\n3\n4\n", 8192) = 8
lseek(0, -6, SEEK_CUR) = 2
1
+++ exited with 0 +++
-2-
-3-
-4-

Have a nice day,
Berny
Paul Eggert
2018-11-06 19:52:25 UTC
Permalink
Post by Philip Rowlands
To achieve consistency in the other direction, head could ignore the optimization to reduce the number of bytes read, and always read 8192 bytes, knowing that some would be discarded.
Let's not do that. It's less efficient and less useful than what GNU
'head -c4' is doing now.
Post by Philip Rowlands
For widely differing values, the only way to produce the same residual output would be to consume all input data.
Eeuuww. Let's *especially* not do that.

Loading...