Discussion:
bug#7489: [coreutils] over aggressive threads in sort
(too old to reply)
DJ Lucas
2010-11-26 18:01:55 UTC
Permalink
Sent too bug-coreutils too (no bug id currently AFAICT).

Bug only affects multi-byte locales. Take the following samples:



bash-4.1# zcat cracklib-words-20080507.gz | sort -u --debug > file &&
echo $?
sort: using `en_US.UTF-8' sorting rules
Segmentation fault
bash-4.1# echo $?
139
bash-4.1#


bash-4.1# zcat cracklib-words-20080507.gz | sort -u --parallel=1
--debug > file && echo $?
sort: using `en_US.UTF-8' sorting rules
0
bash-4.1#


bash-4.1# zcat cracklib-words-20080507.gz | LANG=C sort -u --debug >
file && echo $?
sort: using simple byte comparison
0
bash-4.1#


bash-4.1# gzip -d cracklib-words-20080507.gz
bash-4.1# sort -u --debug cracklib-words-20080507 > file && echo $?
sort: using `en_US.UTF-8' sorting rules
0
bash-4.1#


In the interim, for a quick and dirty hack, I've added an LC_COLLATE
comparison and set nthreads to 1 in multibyte locales.

Probably well known, but the test file that I used is available from:
http://downloads.sourceforge.net/cracklib/cracklib-words-20080507.gz

-- DJ Lucas
--
This message has been scanned for viruses and
dangerous content, and is believed to be clean.
Paul Eggert
2010-11-26 23:24:05 UTC
Permalink
Thanks for the bug report. Unfortunately,
I cannot reproduce the problem with coreutils 8.7, either on
RHEL 5.5 x86-64 or on Ubuntu 10.10 x86.

Which version of coreutils are you running? And on what
platform? How did you build it?

Can you reproduce it with --parallel=2? If not, which value
of --parallel is needed?

If this is coreutils 8.7, can you get a core dump and a GDB
backtrace?
DJ Lucas
2010-11-27 03:00:01 UTC
Permalink
Post by Paul Eggert
Thanks for the bug report. Unfortunately,
I cannot reproduce the problem with coreutils 8.7, either on
RHEL 5.5 x86-64 or on Ubuntu 10.10 x86.
Which version of coreutils are you running?
8.7. Haven't tested on 8.6 or 8.5. 8.4 worked correctly, but that was
expected since I can get around it if not running in multiple threads.
Post by Paul Eggert
And on what
platform?
LFS 32bit on Athlon II 250 (64bit dual core CPU/hardware). gcc-4.5.1,
glibc-2.12.1, binutils-2.20.1.
Post by Paul Eggert
How did you build it?
Basic CMMI --prefix=/usr. Built both with and without i18n RH patch and
Linux only uname patches.
Post by Paul Eggert
Can you reproduce it with --parallel=2? If not, which value
of --parallel is needed?
Yes, it is reproducible with --parallel=2. Seems this is a difficult
one to track down. So far we've only confirmed that it happens in
multibyte locales, with 32bit build on 64bit multi-core hardware, and
only from stdin. Works fine if the file is passed as an argument to
sort. Haven't confirmed 64bit build had multibyte yet, but the error did
not occur on the report from that user. Setting LANG/LC_ALL to C or
POSIX works around it, as well as --parallel=1. I've added a quick and
really ugly workaround in the interim so that I can continue and get
some work done (using getenv("LC_COLLATE"), and testing only against !=
POSIX and != C, setting nthreads=1). Confirmed that the correct encoding
is displayed with --debug with the workaround in place.

If you have a similar setup laying around (32bit installation on 64bit
multi-core hardware), try
cat bigfile.txt | LANG=en_US.UTF-8 sort -u (or similar)
Post by Paul Eggert
If this is coreutils 8.7, can you get a core dump and a GDB
backtrace?
Sure, give me a bit, but looks like Pádraig Brady beat me to it.

-- DJ Lucas
--
This message has been scanned for viruses and
dangerous content, and is believed to be clean.
Pádraig Brady
2010-11-27 03:32:37 UTC
Permalink
Post by DJ Lucas
Setting LANG/LC_ALL to C or
POSIX works around it, as well as --parallel=1. I've added a quick and
really ugly workaround in the interim so that I can continue and get
some work done (using getenv("LC_COLLATE"), and testing only against !=
POSIX and != C, setting nthreads=1). Confirmed that the correct encoding
is displayed with --debug with the workaround in place.
Best workaround I think is to leave the code alone,
and set OMP_NUM_THREADS=1 in your env
as otherwise you may get invalid results
in any locale.

cheers,
Pádraig.
Pádraig Brady
2010-11-27 02:52:56 UTC
Permalink
Post by DJ Lucas
Sent too bug-coreutils too (no bug id currently AFAICT).
bash-4.1# zcat cracklib-words-20080507.gz | sort -u --debug > file &&
echo $?
sort: using `en_US.UTF-8' sorting rules
Segmentation fault
bash-4.1# echo $?
139
bash-4.1#
bash-4.1# zcat cracklib-words-20080507.gz | sort -u --parallel=1
--debug > file && echo $?
sort: using `en_US.UTF-8' sorting rules
0
bash-4.1#
bash-4.1# zcat cracklib-words-20080507.gz | LANG=C sort -u --debug >
file && echo $?
sort: using simple byte comparison
0
bash-4.1#
bash-4.1# gzip -d cracklib-words-20080507.gz
bash-4.1# sort -u --debug cracklib-words-20080507 > file && echo $?
sort: using `en_US.UTF-8' sorting rules
0
bash-4.1#
In the interim, for a quick and dirty hack, I've added an LC_COLLATE
comparison and set nthreads to 1 in multibyte locales.
http://downloads.sourceforge.net/cracklib/cracklib-words-20080507.gz
-- DJ Lucas
100% reproducible on a 32 bit F12 box (first I tried)

# zcat cracklib-words-20080507.gz | ./coreutils-8.7/src/sort -u --parallel=1 > /dev/null
# zcat cracklib-words-20080507.gz | ./coreutils-8.7/src/sort -u --parallel=2 > /dev/null
Segmentation fault (core dumped)

#0 0x00c4aecf in __strlen_ia32 () from /lib/libc.so.6
#1 0x00c4ecf1 in strcoll_l () from /lib/libc.so.6
#2 0x00c4a8b1 in strcoll () from /lib/libc.so.6
#3 0x0805682d in strcoll_loop (s1=<value optimized out>, s1size=<value optimized out>, s2=<value optimized out>, s2size=1852142453) at memcoll.c:39
#4 memcoll0 (s1=<value optimized out>, s1size=<value optimized out>, s2=<value optimized out>, s2size=1852142453) at memcoll.c:110
#5 0x08054263 in xmemcoll0 (s1=0xb5457008 "lauters", s1size=8, s2=0x6f626f79 <Address 0x6f626f79 out of bounds>, s2size=1852142453) at xmemcoll.c:71
#6 0x0804e091 in compare (a=0xb68cb558, b=0xb5c3e9b8) at sort.c:2653
#7 0x0804f320 in write_unique (line=0xb68cb558, tfp=0x9f2ebe0, temp_output=0x9f2e9c0 "/tmp/sortqd5BvL") at sort.c:3233
#8 0x0804f5b0 in mergelines_node (node=0xbfd06c0c, total_lines=822189, tfp=0x9f2ebe0, temp_output=0x9f2e9c0 "/tmp/sortqd5BvL") at sort.c:3279
#9 0x0804f8e4 in merge_loop (queue=0xbfd06ca4, total_lines=822189, tfp=0x9f2ebe0, temp_output=0x9f2e9c0 "/tmp/sortqd5BvL") at sort.c:3367
#10 0x0804fc52 in sortlines (lines=0xb7557038, dest=0xb68cb568, nthreads=1, total_lines=822189, parent=0xbfd06c0c, lo_child=true, merge_queue=0xbfd06ca4, tfp=0x9f2ebe0, temp_output=0x9f2e9c0 "/tmp/sortqd5BvL")
at sort.c:3481
#11 0x0804f9a4 in sortlines_thread (data=0xbfd06be4) at sort.c:3404
#12 0x00d58ab5 in start_thread () from /lib/libpthread.so.0
#13 0x00caef1e in clone () from /lib/libc.so.6
(gdb) quit

Looks like ASCII data passed on the stack for pointer and size (oboy, nesu)
Hmm, seems like multiple threads are racing to update the
static "saved" variable in write_unique() ?
Paul Eggert
2010-11-27 08:29:16 UTC
Permalink
Post by Pádraig Brady
Hmm, seems like multiple threads are racing to update the
static "saved" variable in write_unique() ?
I don't think it's as simple as that. write_unique
is generating output, and when it is run it is supposed
to have exclusive access to the output file, which means
it also has exclusive access to the "saved" variable.

I suspect the problem is that the line structure is
messed up when you have multiple threads and sufficiently
large input, in that when one merge node is done and
the next one is run, the "saved" variable (which points
to storage managed by the earlier node) is pointing to
storage that is no longer valid. I haven't had time to
look into the details, though.

Chen, do you have some insight into this? Here's the thread:

http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00209.html

I have now reproduced the bug on RHEL 5.5 x86-64, using
coreutils 8.7 built with "configure CC='cc -m32'". The bug
is also present in the savannah git master.
Paul Eggert
2010-11-27 09:39:10 UTC
Permalink
Following up on my previous email, it appears to me that
the following line in mergelines_node is weird:

node->dest -= lo_orig - node->lo + hi_orig - node->hi;

Surely there should be a "*" in front of that line?

(This does not fix the bug; perhaps it is a different bug?)
Paul Eggert
2010-11-28 00:57:53 UTC
Permalink
Could you please try this little patch? It should fix your
problem. I came up with this fix in my sleep (literally!
I woke up this morning and the patch was in my head), but
haven't had time to look at the code in this area to see
if it's the best fix.

Clearly there's at least one more bug as noted in my previous email
<http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00216.html>
but I expect it's less likely to fire.

diff --git a/src/sort.c b/src/sort.c
index 7e25f6a..1aa1eb4 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3226,13 +3226,13 @@ queue_pop (struct merge_node_queue *queue)
static void
write_unique (struct line const *line, FILE *tfp, char const *temp_output)
{
- static struct line const *saved = NULL;
+ static struct line saved;

if (!unique)
write_line (line, tfp, temp_output);
- else if (!saved || compare (line, saved))
+ else if (!saved.text || compare (line, &saved))
{
- saved = line;
+ saved = *line;
write_line (line, tfp, temp_output);
}
}
Pádraig Brady
2010-11-28 01:34:36 UTC
Permalink
Post by Paul Eggert
Could you please try this little patch? It should fix your
problem. I came up with this fix in my sleep (literally!
I woke up this morning and the patch was in my head), but
haven't had time to look at the code in this area to see
if it's the best fix.
Clearly there's at least one more bug as noted in my previous email
<http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00216.html>
but I expect it's less likely to fire.
diff --git a/src/sort.c b/src/sort.c
index 7e25f6a..1aa1eb4 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3226,13 +3226,13 @@ queue_pop (struct merge_node_queue *queue)
static void
write_unique (struct line const *line, FILE *tfp, char const *temp_output)
{
- static struct line const *saved = NULL;
+ static struct line saved;
if (!unique)
write_line (line, tfp, temp_output);
- else if (!saved || compare (line, saved))
+ else if (!saved.text || compare (line, &saved))
{
- saved = line;
+ saved = *line;
write_line (line, tfp, temp_output);
}
}
It passes a fleeting test between kids and sleep...

$ zcat cracklib-words-20080507.gz | ./coreutils-8.7/src/sort -u > /dev/null
Segmentation fault (core dumped)
$ zcat cracklib-words-20080507.gz | ./coreutils-8.7/src/sort -u | wc -l
0
$ zcat cracklib-words-20080507.gz | ./coreutils-8.7/src/sort-paul -u | wc -l
1671704

cheers,
Pádraig.
DJ Lucas
2010-11-28 02:18:29 UTC
Permalink
Post by Paul Eggert
Could you please try this little patch? It should fix your
problem. I came up with this fix in my sleep (literally!
I woke up this morning and the patch was in my head), but
haven't had time to look at the code in this area to see
if it's the best fix.
lfs [ /lfs-source-archive/coreutils-8.7-new/src ]$ cat
/lfs-source-archive/cracklib-words-20080507 | sort -u > /dev/null; echo $?
0
lfs [ /lfs-source-archive/coreutils-8.7-new/src ]$

Appears to work as expected. Thanks for jumping on this so quickly.

-- DJ Lucas
--
This message has been scanned for viruses and
dangerous content, and is believed to be clean.
DJ Lucas
2010-11-29 07:14:21 UTC
Permalink
Post by DJ Lucas
lfs [ /lfs-source-archive/coreutils-8.7-new/src ]$ cat
/lfs-source-archive/cracklib-words-20080507 | sort -u > /dev/null; echo $?
0
lfs [ /lfs-source-archive/coreutils-8.7-new/src ]$
Appears to work as expected. Thanks for jumping on this so quickly.
Okay, so that fixes the segfault for both the original example from the
RedHat bug, and the example I submitted. However, CPU is still showing
100% utilization in the original test (from RedHat bz), and the test
that, I believe (not sure of the quoting), was submitted by PÃdraig
Brady still fails. I don't have a link to the original (maybe private
message), but it is quoted here:
http://lists.gnu.org/archive/html/coreutils/2010-11/msg00124.html. I ran
only the quoted test. Unrelated bug? (again, not entirely sure of the
context)

lfs [ lfs-source-archive ]$ seq 100000 > in
lfs [ lfs-source-archive ]$ mkfifo fifo
lfs [ lfs-source-archive ]$ (for i in $(seq 12); do read line; echo $i;
sleep .1; done
Post by DJ Lucas
cat > /dev/null) < fifo &
[1] 16123
lfs [ lfs-source-archive ]$ (ulimit -t 1; sort in > fifo \
Post by DJ Lucas
|| echo killed via $(env kill -l $(expr $? - 128)))
1
2
3
4
5
6
7
8
9
killed via KILL
lfs [ lfs-source-archive ]$ 10
11
12
^C
[1]+ Done ( for i in $(seq 12);
do
read line; echo $i; sleep .1;
done; cat > /dev/null ) < fifo

-- DJ Lucas
--
This message has been scanned for viruses and
dangerous content, and is believed to be clean.
Pádraig Brady
2010-11-29 11:29:20 UTC
Permalink
Post by DJ Lucas
Post by DJ Lucas
lfs [ /lfs-source-archive/coreutils-8.7-new/src ]$ cat
/lfs-source-archive/cracklib-words-20080507 | sort -u > /dev/null; echo $?
0
lfs [ /lfs-source-archive/coreutils-8.7-new/src ]$
Appears to work as expected. Thanks for jumping on this so quickly.
Okay, so that fixes the segfault for both the original example from the
RedHat bug, and the example I submitted. However, CPU is still showing
100% utilization in the original test (from RedHat bz), and the test
that, I believe (not sure of the quoting), was submitted by PÃdraig
Brady still fails. I don't have a link to the original (maybe private
http://lists.gnu.org/archive/html/coreutils/2010-11/msg00124.html. I ran
only the quoted test. Unrelated bug? (again, not entirely sure of the
context)
There was a suggested workaround for test here:
https://bugzilla.redhat.com/show_bug.cgi?id=655096#c5
I've no time to try that at present though.

sorry,
Pádraig.
Paul Eggert
2010-11-29 19:16:24 UTC
Permalink
Post by DJ Lucas
http://lists.gnu.org/archive/html/coreutils/2010-11/msg00124.html
Ah, sorry, I didn't understand that message and thought Pádraig
had handled it. On an 8-core RHEL 5.5 x86-64 host I reproduced
the problem with the stated test case:

(for i in $(seq 12); do read line; echo $i; sleep .1; done
cat > /dev/null) < fifo &
(ulimit -t 1; ./sort in > fifo \
|| echo killed via $(env kill -l $(expr $? - 128)))

If I understand this other bug correctly, the problem is that one thread
is using a spin lock to wait for another thread, which in turn is waiting
to write to the pipe. Surely the simplest fix is to drop spin locks
entirely and use mutexes instead. Perhaps a better fix would be to
use mutexes at the top level (where threads can write to a file and
therefore can wait) and to use spin locks at lower levels (where
threads are merely storing into memory and thus can't wait).

Chen, do you have any advice on this as well? I can look into
either fix but am mildly inclined to take the simpler (albeit slower)
approach, at least at first. Here's that thread again:

http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00209.html
Chen Guo
2010-11-30 00:34:13 UTC
Permalink
Hi all,
entirely and use mutexes instead.  Perhaps a better fix would be to
use mutexes at the top level (where threads can write to a file and
therefore can wait) and to use spin locks at lower levels (where
threads are merely storing into memory and thus can't wait).
Chen, do you have any advice on this as well?  I can look into
either fix but am mildly inclined to take the simpler (albeit slower)
http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00209.html
I haven't looked at the code in a couple of months, so I wont have a
definite answer
until tonight, but off the top of my head I'm not sure mixing
spinlocks and mutexes
will work, since they exist orthogonally. That is, a thread can lock
the mutex of a
structure, but the spin lock of the structure is still free to be
acquired. The only way
to ensure the struct is locked down is to lock both the mutex and the spinlock,
in which case, what's the point?

The only way this would work is if, when a struct is locked via mutex the only
threads trying to acquire the struct are trying to do so via mutex,
and no threads
are looking to lock via spinlock. I'll take a look when I get home
tonight to see
if this condition always holds.
Paul Eggert
2010-11-30 00:39:44 UTC
Permalink
Post by Chen Guo
The only way this would work is if, when a struct is locked via mutex the only
threads trying to acquire the struct are trying to do so via mutex,
and no threads
are looking to lock via spinlock.
Yes, that's definitely the idea. Under either of my
proposals, a particular object would be protected either
by a spinlock, or by a mutex, but never by both; and the
decision as to whether to use a spinlock or a mutex would
be made by object-creation time at the latest.

Thanks for looking into it.
Paul Eggert
2010-11-30 07:38:25 UTC
Permalink
Hi guys,
Is something up with Savannah? I just tried a git clone and got
connection time out; I cant even reach git.sv.gnu.org via ping.
There was a breakin, which led to leaking of encrypted account
passwords, some of them discovered via a brute-force attack.
It's pretty bad. They're reinstalling and restoring the data
from a November 24 backup. No firm ETA yet. Expect an
official announcement soon.

You can find the current status at <http://savannah.gnu.org/>.
Jim Meyering
2010-12-01 07:37:14 UTC
Permalink
Post by Paul Eggert
Hi guys,
Is something up with Savannah? I just tried a git clone and got
connection time out; I cant even reach git.sv.gnu.org via ping.
There was a breakin, which led to leaking of encrypted account
passwords, some of them discovered via a brute-force attack.
Note that it was web passwords.
Post by Paul Eggert
It's pretty bad. They're reinstalling and restoring the data
from a November 24 backup. No firm ETA yet. Expect an
official announcement soon.
You can find the current status at <http://savannah.gnu.org/>.
Here's the chronology:

http://www.fsf.org/blogs/sysadmin/savannah-and-www.gnu.org-downtime
Chen Guo
2010-11-30 04:32:36 UTC
Permalink
Hi guys,
Is something up with Savannah? I just tried a git clone and got
connection time out; I cant even reach git.sv.gnu.org via ping.
I'll try again at work tomorrow.
Chen Guo
2010-12-02 10:22:58 UTC
Permalink
Hi Professor Eggert,
 (for i in $(seq 12); do read line; echo $i; sleep .1; done
 cat > /dev/null) < fifo &
 (ulimit -t 1; ./sort in > fifo \
 || echo killed via $(env kill -l $(expr $? - 128)))
I ran this 10 times or so on an i7 and couldn't trigger anything. Is
seq 12 supposed to vary depending on the number of cores?
Paul Eggert
2010-12-02 17:48:56 UTC
Permalink
Post by Chen Guo
Post by Paul Eggert
(for i in $(seq 12); do read line; echo $i; sleep .1; done
cat > /dev/null) < fifo &
(ulimit -t 1; ./sort in > fifo \
|| echo killed via $(env kill -l $(expr $? - 128)))
I ran this 10 times or so on an i7 and couldn't trigger anything. Is
seq 12 supposed to vary depending on the number of cores?
I imagine it does. It's timing-dependent; I can't always reproduce
it on my test host (Intel Xeon E5620 2.4 GHz).

I just now realized that the above doesn't say what "in" is; it's
the output of "seq 100000".

Also, it may help to use an explicit --parallel=2 or whatever.

What happens if you do the following with the latest git version
(savannah's back up, by the way)?

cd tests && make check TESTS=misc/sort-spinlock-abuse

This gets an XFAIL fairly reliably on my test host.
Jim Meyering
2010-12-02 21:42:57 UTC
Permalink
Post by Chen Guo
Hi Professor Eggert,
 (for i in $(seq 12); do read line; echo $i; sleep .1; done
 cat > /dev/null) < fifo &
 (ulimit -t 1; ./sort in > fifo \
 || echo killed via $(env kill -l $(expr $? - 128)))
I ran this 10 times or so on an i7 and couldn't trigger anything. Is
seq 12 supposed to vary depending on the number of cores?
Hi Chen,

Here's a stand-alone command-line test using the sort in your PATH:

rm -f in fifo
seq 100000 > in
mkfifo fifo
(for i in $(seq 12); do read line; echo $i; sleep .1; done
cat > /dev/null) < fifo &
(ulimit -t 1; sort in > fifo \
|| echo killed via $(env kill -l $(expr $? - 128)))

The 12 x 0.1-second sleeps are to ensure that the busy-waiting
sort will accumulate more than 1.0 seconds of CPU time, and thus
surpass the CPU time limit imposed by "ulimit -t 1".

With more processes, then you may use a number smaller than 12.
Running on a 6-core i7, I see the "killed via XCPU" message
anywhere between the "1" and "4" output lines. E.g.,

1
2
killed via XCPU
$ 3 :
4
5
6
7
8
9
10
11
12
Chen Guo
2010-12-03 20:18:46 UTC
Permalink
Thanks Jim, that helped a lot.

I'll try out Professor Eggert's suggestion, of switching to mutexes
only at the top level merge. Of the following approaches, which would
you guys consider better practice?

1) void pointer, cast as either mutex or spinlock in lock function
2) union of mutex and spinlock, chosen in lock function

The idea is basically, in the lock function if we see the node is
MERGE_LEVEL_ROOT we'd lock the mutex either via cast or union member,
and likewise for other merge levels for spinlock.
Paul Eggert
2010-12-03 21:10:58 UTC
Permalink
Post by Chen Guo
I'll try out Professor Eggert's suggestion, of switching to mutexes
only at the top level merge.
I'm having second thoughts about that. Yes, that'll prevent the
top-level merge (which is generating the actual output) from chewing
up CPU time. But it already has that property, since it's outputting
to stdout. And if second-level merges use mutexes, then the third-level
merges will spin. We'll have to use mutexes at all levels, unless
I'm missing something.

How about this idea instead. Keep using spin locks everywhere, but
have the top-level merge output to memory, as the lower merges already
do. The main thread can wait for the top level merge to finish and
then generate output. That way, none of the merges will have to wait
on an output pipe (or a slow output file).

Either option (either switch to mutexes everywhere, or have the top-level
merge go to memory) should work. Perhaps we should try both and benchmark
them.
Chen Guo
2010-12-06 05:16:20 UTC
Permalink
Hi Professor Eggert,
Post by Paul Eggert
Either option (either switch to mutexes everywhere, or have the top-level
merge go to memory) should work.  Perhaps we should try both and benchmark
them.
Test machine is 4 core i7. The numbers I'm giving are averaged
over 20 runs, given in seconds, and are of the form elapsed / user +
system.

spinlock:
1 thread: 3.354 / 3.349
2 threads: 1.960 / 3.812
4 threads: 1.366 / 5.085

mutex:
1 thread: 3.354 / 3.350
2 threads: 2.062 / 3.628
4 threads: 1.497 / 4.172

spin/ output after
1 thread: 3.519 / 3.517
2 threads: 2.098 / 3.996
4 threads: 1.488 / 5.347

It seems if we have to choose between mutex and output post-sort,
mutex is the way to go. Mutex is faster in the single threaded case,
while in multithreaded the elapsed time is negligibly different, the
user time is much greater. With spinlocks only, the greater system
time was justified (though some might disagree) by the lower elapsed
time. With spinlock outputting post-sort, there is no more
justification for the higher user time.

Before saying anything else, I should note that for mutexes, on 4
threads 20% of the time there's a segfault on a seemingly innocuous
line in queue_insert ():
node->queued = true

GDB shows that pointers all look normal, and I could not trigger this
over 10 runs with valgrind (it seems valgrind is singlethreaded). If
we do decide to go back to mutexes, I'll look into this issue more.
Paul Eggert
2010-12-06 07:01:27 UTC
Permalink
Post by Chen Guo
Before saying anything else, I should note that for mutexes, on 4
threads 20% of the time there's a segfault on a seemingly innocuous
node->queued = true
It does sound like mutexes are the way to go, and that this bug
needs to be fixed. I assume that this is the call to queue_insert
from queue_check_insert_parent? What's the backtrace? (And
what patch are you using, to get mutexes?)
Chen Guo
2010-12-06 08:25:26 UTC
Permalink
Hi Professor Eggert,
Post by Paul Eggert
Post by Chen Guo
Before saying anything else, I should note that for mutexes, on 4
threads 20% of the time there's a segfault on a seemingly innocuous
  node->queued = true
It does sound like mutexes are the way to go, and that this bug
needs to be fixed.  I assume that this is the call to queue_insert
from queue_check_insert_parent?  What's the backtrace?  (And
what patch are you using, to get mutexes?)
I've attached the patch (inlined at the bottom). Here's the GDB
crash, with backtrace. I also printed node->queued in GDB, so it's
evident that its accessible.

(gdb) run --parallel 2 rec_1M > /dev/null
Starting program: /data/chen/Coding/Coreutils/test/sort-mutex
--parallel 2 rec_1M > /dev/null
[Thread debugging using libthread_db enabled]
[New Thread 0x7fffcbb95710 (LWP 19850)]

Program received signal SIGSEGV, Segmentation fault.
queue_insert (queue=0x7fffffffe0c0, node=0x7ffff7ddc560) at sort.c:3202
3202 node->queued = true;
(gdb) bt
#0 queue_insert (queue=0x7fffffffe0c0, node=0x7ffff7ddc560) at sort.c:3202
#1 0x0000000000405844 in queue_check_insert_parent (lines=<value
optimized out>, dest=<value optimized out>, nthreads=<value optimized
out>, total_lines=<value optimized out>, parent=<value optimized out>,
lo_child=<value optimized out>, queue=0x7fffffffe0c0,
tfp=0x7ffff7bb9780, temp_output=0x0) at sort.c:3340
#2 merge_loop (lines=<value optimized out>, dest=<value optimized
out>, nthreads=<value optimized out>, total_lines=<value optimized
out>, parent=<value optimized out>, lo_child=<value optimized out>,
queue=0x7fffffffe0c0,
tfp=0x7ffff7bb9780, temp_output=0x0) at sort.c:3374
#3 sortlines (lines=<value optimized out>, dest=<value optimized
out>, nthreads=<value optimized out>, total_lines=<value optimized
out>, parent=<value optimized out>, lo_child=<value optimized out>,
queue=0x7fffffffe0c0,
tfp=0x7ffff7bb9780, temp_output=0x0) at sort.c:3515
#4 0x0000000000405c2b in sortlines (lines=0x7ffff7344030, dest=<value
optimized out>, nthreads=<value optimized out>, total_lines=<value
optimized out>, parent=<value optimized out>, lo_child=<value
optimized out>,
queue=0x7fffffffe0c0, tfp=0x7ffff7bb9780, temp_output=0x0) at sort.c:3494
#5 0x00000000004095e8 in sort (argc=<value optimized out>,
argv=<value optimized out>) at sort.c:3804
#6 main (argc=<value optimized out>, argv=<value optimized out>) at sort.c:4576
(gdb) print node
$1 = (struct merge_node *) 0x7ffff7ddc560
(gdb) print node->queued
$2 = false
Jim Meyering
2010-12-07 10:06:53 UTC
Permalink
Chen Guo wrote:
...
Post by Chen Guo
I've attached the patch (inlined at the bottom). Here's the GDB
crash, with backtrace. I also printed node->queued in GDB, so it's
evident that its accessible.
(gdb) run --parallel 2 rec_1M > /dev/null
Starting program: /data/chen/Coding/Coreutils/test/sort-mutex
--parallel 2 rec_1M > /dev/null
[Thread debugging using libthread_db enabled]
[New Thread 0x7fffcbb95710 (LWP 19850)]
Program received signal SIGSEGV, Segmentation fault.
queue_insert (queue=0x7fffffffe0c0, node=0x7ffff7ddc560) at sort.c:3202
3202 node->queued = true;
(gdb) bt
...
Post by Chen Guo
(gdb) print node
$1 = (struct merge_node *) 0x7ffff7ddc560
(gdb) print node->queued
$2 = false
Could it be that you're looking at one thread,
while the segfault happened in another?
In my core dump, the offending "node" pointer had a value of 5.

...
Post by Chen Guo
Subject: [PATCH] sort: change spinlocks to mutexes.
* src/sort.c: (struct merge_node) Change member lock to mutex. All
uses changed.
Hi Chen,

Thanks for the patch.
Since this fixes a bug, I've added a NEWS entry.
Also, since with your patch, the sort-spinlock-abuse
test now passes, I've adjusted tests/Makefile.am to reflect that.
The segfault may be easier to reproduce with mutexes,
but I've seen the same one also with spinlocks (albeit rarely).

Here's the amended patch:
Jim Meyering
2010-12-07 11:24:57 UTC
Permalink
Post by Chen Guo
Hi Professor Eggert,
Post by Paul Eggert
Post by Chen Guo
Before saying anything else, I should note that for mutexes, on 4
threads 20% of the time there's a segfault on a seemingly innocuous
  node->queued = true
It does sound like mutexes are the way to go, and that this bug
needs to be fixed.  I assume that this is the call to queue_insert
from queue_check_insert_parent?  What's the backtrace?  (And
what patch are you using, to get mutexes?)
I've attached the patch (inlined at the bottom). Here's the GDB
crash, with backtrace. I also printed node->queued in GDB, so it's
evident that its accessible.
(gdb) run --parallel 2 rec_1M > /dev/null
Starting program: /data/chen/Coding/Coreutils/test/sort-mutex
--parallel 2 rec_1M > /dev/null
Hi Chen,

Thanks. What does your input file look like?
I've been unable to reproduce the failure using the output of
seq 1000000. I've tried a few different -S ... options, in case
the amount of available memory is a factor:

seq 1000000 > in-1M
for i in $(seq 1000); do \
printf '%03d\r' $i; src/sort --parallel=2 -S 1M in-1M > /dev/null; done
Chen Guo
2010-12-07 21:01:56 UTC
Permalink
Hi Jim,
Post by Jim Meyering
Hi Chen,
Thanks.  What does your input file look like?
I've been unable to reproduce the failure using the output of
seq 1000000.  I've tried a few different -S ... options, in case
 seq 1000000 > in-1M
 for i in $(seq 1000); do \
   printf '%03d\r' $i; src/sort --parallel=2 -S 1M in-1M > /dev/null; done
The input file I used was generated with gensort
(http://www.ordinal.com/gensort.html)
I used the -a 1000000 to generate a 1 million line ASCII file. Running
sort 10 times on 2 or 4 threads almost always triggered at least 1
segfault.
Jim Meyering
2010-12-09 11:31:14 UTC
Permalink
[I've retitled and Cc'd bug-coreutils so this gets its own bug number]
Post by Chen Guo
Post by Jim Meyering
Thanks.  What does your input file look like?
I've been unable to reproduce the failure using the output of
seq 1000000.  I've tried a few different -S ... options, in case
 seq 1000000 > in-1M
 for i in $(seq 1000); do \
   printf '%03d\r' $i; src/sort --parallel=2 -S 1M in-1M > /dev/null; done
The input file I used was generated with gensort
(http://www.ordinal.com/gensort.html)
I used the -a 1000000 to generate a 1 million line ASCII file. Running
sort 10 times on 2 or 4 threads almost always triggered at least 1
segfault.
Thank you.
With that, I've solved at least part of the problem.
The segfault (and other strangeness we've witnessed)
arises because each "node" struct is stored on the stack,
and its address ends up being used by another thread after
the thread that owns the stack in question has been "joined".

My solution is to use the heap instead of the stack.
However, for today I'm out of time and I have not yet found a
way to free these newly-malloc'd "node" buffers.

To test this, I've done the following:

gensort -a 10000 > gensort-10k
for i in $(seq 2000); do printf '% 4d\n' $i; valgrind --quiet src/sort -S 100K \
--parallel=2 gensort-10k > k; test $(wc -c < k) = 1000000 || break; done
for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
--parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done

Without the patch, the first would show errors for more than 50% of
the runs and the second would rarely get to i=100 without generating
a core file. With the patch, both complete error-free (not counting
leaks).

I'll add tests and a NEWS item, eventually.
Jim Meyering
2010-12-09 21:33:42 UTC
Permalink
Jim Meyering wrote:
...
Post by Jim Meyering
With that, I've solved at least part of the problem.
The segfault (and other strangeness we've witnessed)
arises because each "node" struct is stored on the stack,
and its address ends up being used by another thread after
the thread that owns the stack in question has been "joined".
My solution is to use the heap instead of the stack.
However, for today I'm out of time and I have not yet found a
way to free these newly-malloc'd "node" buffers.
gensort -a 10000 > gensort-10k
for i in $(seq 2000); do printf '% 4d\n' $i; valgrind --quiet src/sort -S 100K \
--parallel=2 gensort-10k > k; test $(wc -c < k) = 1000000 || break; done
for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
--parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done
Without the patch, the first would show errors for more than 50% of
the runs and the second would rarely get to i=100 without generating
a core file. With the patch, both complete error-free (not counting
leaks).
FYI, while preparing a test, I've found that the latter test
(without valgrind) passes 2000/2000 tests when compiled with -g -O2,
yet fails in at least 10 of the 2000 when compiled with -ggdb3.
Paul Eggert
2010-12-09 18:26:28 UTC
Permalink
Post by Jim Meyering
The segfault (and other strangeness we've witnessed)
arises because each "node" struct is stored on the stack,
and its address ends up being used by another thread after
the thread that owns the stack in question has been "joined".
Ah, of *course*!
Post by Jim Meyering
My solution is to use the heap instead of the stack.
This can be simplified by allocating all the nodes at once,
since the number of nodes is bounded above by the number
of threads. I'll take a look at this, if nobody else gets
to it first.

I am also still looking at the problem with the compression/decompression
hang. The code to do subprocesses is busted in more than
one way: it does not always catch failures (nonzero exit status)
in decompression, and it can falsely report
failure even when compression and decompression are
both perfectly OK. However, I still don't have a handle
on the actual bug that causes the hang.
Chen Guo
2010-12-10 21:32:26 UTC
Permalink
+  size_t nlines = (lo_child)? node->parent->nlo : node->parent->nhi;
Small change... Remove unnecessary branch and redirection in sortlines:

size_t nlines = node->nlo + node->nhi;
Paul Eggert
2010-12-11 08:35:20 UTC
Permalink
Thanks, Chen, that was much nicer than what I was writing.
I did some minor cleanups, mostly to do with catching an
unlikely integer overflow that would have made 'sort' crash badly,
and pushed the following set of patches.
Jim Meyering
2010-12-11 11:09:17 UTC
Permalink
Post by Paul Eggert
Thanks, Chen, that was much nicer than what I was writing.
I did some minor cleanups, mostly to do with catching an
unlikely integer overflow that would have made 'sort' crash badly,
and pushed the following set of patches.
Thanks for helping, but...

Chen's log message would have been appropriate if that patch had been
rebased to apply after my stack->heap bug fix. However, as a replacement
for my fix, the description is lacking, since it says nothing about fixing
the hard-to-reproduce (and harder-to-diagnose) segfault-inducing bug.
Plus, the change set includes no NEWS or test suite addition.

With each bug fix it is best to describe
or at least mention the bug in the commit log.
Also, there should be a NEWS entry and, preferably,
a test or two to exercise the bug.

Here at least is the NEWS addition and log from what I posted Thursday.
I've also fixed a few syntax nits separately:
Paul Eggert
2010-12-11 20:00:48 UTC
Permalink
Sorry for botching the NEWS and the change log. To help
make amends, how about if I add a test case for that?
I'm thinking of the 2nd test case in
<http://lists.gnu.org/archive/html/bug-coreutils/2010-12/msg00043.html>,
namely this one:

gensort -a 10000 > gensort-10k
for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
--parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done

Or if you have a better one in mind, please let me know.

There are also some test cases I need to add for the
(unrelated) sort-compression bug, which is next on my
list of coreutils bugs to look at.
Jim Meyering
2010-12-12 15:41:07 UTC
Permalink
Post by Paul Eggert
Sorry for botching the NEWS and the change log. To help
make amends, how about if I add a test case for that?
That would be welcome. Thanks.
Post by Paul Eggert
I'm thinking of the 2nd test case in
<http://lists.gnu.org/archive/html/bug-coreutils/2010-12/msg00043.html>,
gensort -a 10000 > gensort-10k
for i in $(seq 2000); do printf '% 4d\n' $i; src/sort -S 100K \
--parallel=2 gensort-10k > j; test $(wc -c < j) = 1000000 || break; done
That sounds good, assuming it triggers the bug reliably for you.
I was hoping to find a way to reproduce it without relying on gensort,
but won't object if you want to do that.

I ran the above a few times on a 6-core i7 970 and it would usually
fail only 2 or 3 times out of 2000. For one trial, it failed
only once, and that was on the 1932nd iteration.

Interestingly, when I run enough of those 2-process jobs in parallel to keep
all nominal 12 processors busy, I get 80-100 failures in 10,000 trials.

cat <<\EOF > sort-test
#!/bin/bash
exp_output=$1
t=$(mktemp) || exit 2
src/sort --parallel=2 -S 100k in > $t || exit 3
cmp $t $exp_output || exit 4
rm -f $t
exit 0
EOF
chmod a+x sort-test

i.e., by running that little script via GNU parallel:

export TMPDIR=/dev/shm/z; mkdir $TMPDIR
gensort -a 1000 in; sort in > exp; seq 10000 \
| parallel --eta -j $(($(nproc)/2)) ./sort-test exp

This shows how many failed of the 10,000:

ls -1 $TMPDIR/tmp.*|wc -l :

I'm not suggesting to use GNU parallel in the test,
but if you have a few spare cores or systems, it does
make it easier to run large parallel tests in an attempt
to find something more reliable.

I noticed that decreasing the input size to 1000 lines
had little effect on the number of failures, but obviously
makes the test less expensive.
Post by Paul Eggert
Or if you have a better one in mind, please let me know.
There are also some test cases I need to add for the
(unrelated) sort-compression bug, which is next on my
list of coreutils bugs to look at.
It would be great to fix that for 8.8, too,
but don't worry if you don't get to it soon.
I can always make an 8.9 not long afterward.
Perhaps very few people use --compress-program=PROG,
or this is due to a latent problem that is showing up only now.
Paul Eggert
2010-12-12 21:42:31 UTC
Permalink
Post by Jim Meyering
That sounds good, assuming it triggers the bug reliably for you.
I was hoping to find a way to reproduce it without relying on gensort,
but won't object if you want to do that.
In my attempts to reproduce the problem, it's pretty flaky.
I think it depends on how busy the operating system is.
Sometimes I'd get failures all the time; sometimes, almost
never. (This was with valgrind; I had much less luck without
valgrind.)

Anyway, I pushed this, which seemed to work well enough
on my host. It prefers gensort if available, but falls
back on seq+shuf if not.
Jim Meyering
2010-12-13 07:30:44 UTC
Permalink
Post by Paul Eggert
Post by Jim Meyering
That sounds good, assuming it triggers the bug reliably for you.
I was hoping to find a way to reproduce it without relying on gensort,
but won't object if you want to do that.
In my attempts to reproduce the problem, it's pretty flaky.
I think it depends on how busy the operating system is.
Sometimes I'd get failures all the time; sometimes, almost
never. (This was with valgrind; I had much less luck without
valgrind.)
Anyway, I pushed this, which seemed to work well enough
on my host. It prefers gensort if available, but falls
back on seq+shuf if not.
Subject: [PATCH] tests: test for access to stale thread memory
* tests/misc/sort-stale-thread-mem: New tests.
* tests/Makefile.am (TESTS): Add it.
Thank you!
I did this, too:
Paul Eggert
2010-12-13 18:04:25 UTC
Permalink
My recent patch had a typo in a comment, which I fixed as follows:
Paul Eggert
2010-12-16 22:06:01 UTC
Permalink
Post by Jim Meyering
Post by Paul Eggert
There are also some test cases I need to add for the
(unrelated) sort-compression bug, which is next on my
list of coreutils bugs to look at.
It would be great to fix that for 8.8, too,
but don't worry if you don't get to it soon.
I've fixed that now, with the patch enclosed at the end of this
message. I was trying just to fix bug #2 mentioned in
<http://lists.gnu.org/archive/html/bug-coreutils/2010-12/msg00078.html>.
But I discovered that the patch also fixes bug #3 listed there, as
that bug was also present in the old reference-count implementation of
waiting for subprocesses. Bug #1 was fixed a day or so ago, so, as
far as I know, all the 'sort' bugs we've discussed in the last month
or so are fixed now.

The new test takes up to 28 minutes when it fails, so I marked it as
very expensive (I don't know what the boundary between "expensive"
and "very expensive" is, but I'm kind of an impatient guy....).
Pádraig Brady
2010-12-16 22:35:03 UTC
Permalink
Post by Paul Eggert
Post by Jim Meyering
Post by Paul Eggert
There are also some test cases I need to add for the
(unrelated) sort-compression bug, which is next on my
list of coreutils bugs to look at.
It would be great to fix that for 8.8, too,
but don't worry if you don't get to it soon.
I've fixed that now, with the patch enclosed
Chen Guo
2010-12-10 21:23:10 UTC
Permalink
Hi Professor Eggert,
Post by Paul Eggert
This can be simplified by allocating all the nodes at once,
since the number of nodes is bounded above by the number
of threads.  I'll take a look at this, if nobody else gets
to it first.
Got to it first :-)
This approach has the side benefit of letting me refactor that whole
node creation block to its own function, and thus cleaning up
sortlines quite a bit. Over 60 runs each on 2 and 4 threads, I am no
longer seeing any segfaults.
Assaf Gordon
2018-10-30 07:47:25 UTC
Permalink
(triaging old bugs)
Hello,

This long thread ( http://bugs.gnu.org/7489 )
deals with multiple parallel-sort bugs, resulting in many commits:
====
1d0a12037 Paul Eggert 2010-12-22 sort: minor performance tweak with
num_processors
41159f960 Pádraig Brady 2010-12-20 maint: fix a typo in sort --parallel
help message
0e181024c Pádraig Brady 2010-12-18 sort: use at most 8 threads by default
8e81a99c2 Paul Eggert 2010-12-16 sort: do not generate thousands of
subprocesses for 16-way merge
1b31ce698 Paul Eggert 2010-12-16 sort: fix hang with sort --compress
f3c584d1e Paul Eggert 2010-12-16 sort: don't dump core when merging
from input twice
14ad7a255 Paul Eggert 2010-12-14 sort: fix very-unlikely buffer
overrun when merging to input file
8f40ed634 Paul Eggert 2010-12-14 sort: document --compress reaper fixes
0da4d8430 Paul Eggert 2010-12-13 sort: fix some --compress reaper bugs
ad61335bf Jim Meyering 2010-12-11 sort: avoid segfault when using two
or more threads
9a9d69e9e Jim Meyering 2010-12-11 sort: syntax cleanup
27e997d0e Paul Eggert 2010-12-11 sort: integer overflow checks in
thread counts, etc.
c9db0ac6d Chen Guo 2010-12-10 sort: preallocate merge tree nodes
to heap.
d1f700355 Paul Eggert 2010-12-10 sort: comment fix
621876ff4 Chen Guo 2010-12-06 sort: use mutexes, not spinlocks
(avoid busy loop on blocked output)
a1629ba1e Jim Meyering 2010-12-05 tests: remove useless definition of
$SORT in sort-compress
cd00fa6ee Paul Eggert 2010-12-03 sort: merge_queue -> queue
fb282c7b3 Paul Eggert 2010-12-03 sort: clarify queue_check_insert
fb41e82c7 Paul Eggert 2010-12-03 sort: fix problems with merge node
dest pointer
f2d977aff Paul Eggert 2010-12-03 sort: simplify write_unique
6f4279421 Paul Eggert 2010-12-03 sort: put queue arg first
f35f4b339 Paul Eggert 2010-12-03 sort: tune struct_merge_node slightly
eb989f4b7 Paul Eggert 2010-12-03 sort: Clarify comments
0ec869e8b Paul Eggert 2010-12-01 sort: fix bug on 64-bit hosts with
at least 32768 processors
===

All the above were included in coreutils 8.8.

I assume the sort bugs are fixed, despite no 'final' message in the thread.

If there are no objections, I'll close this bug in couple of days.

-assaf

Jim Meyering
2010-11-29 20:09:30 UTC
Permalink
Post by Paul Eggert
Could you please try this little patch? It should fix your
problem. I came up with this fix in my sleep (literally!
I woke up this morning and the patch was in my head), but
haven't had time to look at the code in this area to see
if it's the best fix.
Clearly there's at least one more bug as noted in my previous email
<http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00216.html>
but I expect it's less likely to fire.
diff --git a/src/sort.c b/src/sort.c
index 7e25f6a..1aa1eb4 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3226,13 +3226,13 @@ queue_pop (struct merge_node_queue *queue)
static void
write_unique (struct line const *line, FILE *tfp, char const *temp_output)
{
- static struct line const *saved = NULL;
+ static struct line saved;
if (!unique)
write_line (line, tfp, temp_output);
- else if (!saved || compare (line, saved))
+ else if (!saved.text || compare (line, &saved))
{
- saved = line;
+ saved = *line;
write_line (line, tfp, temp_output);
}
}
Nice.
That looks right to me.

FYI, what happens is the first fillbuf call gets a block
of lines, and "saved" ends up pointing at one of them.
The second fillbuf's fread races to overwrite a byte or two of the
saved->text pointer that is being dereferenced by the other thread.

The patch below adds a small test case to exercise it.
It demonstrates the failure in all but ~20 trials out of 1000 for me.
So far, I've reproduced this only on multi-core systems, with --parallel=2.
Chen Guo
2010-12-01 00:19:44 UTC
Permalink
On Tue, Nov 30, 2010 at 2:22 PM, Paul Eggert <***@cs.ucla.edu> wrote:
Hi Professor Eggert,
Anyway, perhaps Chen can review them (I don't have time
to test them right now).
I'll look at it as soon as Savannah's back up; I never actually pulled
from coreutils after the original patch was submitted. Professor
Eggert, could you detail how you can trigger the divide-by-zero bug?
Paul Eggert
2010-12-01 06:16:05 UTC
Permalink
Post by Chen Guo
could you detail how you can trigger the divide-by-zero bug?
Invoke MAX_MERGE(total, level) with level == 15.
2 << level yields 65536, and 65536 * 65536 overflows to zero.
Paul Eggert
2010-12-02 05:57:11 UTC
Permalink
Post by Paul Eggert
Invoke MAX_MERGE(total, level) with level == 15.
2 << level yields 65536, and 65536 * 65536 overflows to zero.
I managed to reproduce this bug on a (faked) host with
32768 processors, using a command like this:

seq 1000000000 | sort --parallel=32768 -S 10G

The result was a floating point exception (actually, a division
by zero) and 'sort' crashed.

However, the bug is timing dependent and is very hard to
reproduce. I tried many more times to reproduce it, and
they all failed.

This proved to my satisfaction that it is a real bug, though,
so I pushed the following patch.
Paul Eggert
2010-11-30 22:22:14 UTC
Permalink
Is there anything you'd like to add?
No, thanks, that looks good. I have some other patches
to clean things up in this area, but they can wait.
I hate to tease, so here is a draft of the cleanup patches.
Most of this stuff is cleanup, but the first line of the change
does fix a bug with very large sorts: when level > 14 on a
typical 64-bit host, there is an unwanted integer overflow
and 'sort' will divide by zero. Admittedly this bug is
unlikely (doesn't this mean you need thousands of threads?).
Anyway, perhaps Chen can review them (I don't have time
to test them right now).

Plus we have to fix the other bug. I wrote these cleanup patches
while trying to understand the code well enough to fix the other bug.

diff --git a/src/sort.c b/src/sort.c
index 1aa1eb4..708cc3c 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -107,7 +107,7 @@ struct rlimit { size_t rlim_cur; };
/* Maximum number of lines to merge every time a NODE is taken from
the MERGE_QUEUE. Node is at LEVEL in the binary merge tree,
and is responsible for merging TOTAL lines. */
-#define MAX_MERGE(total, level) ((total) / ((2 << level) * (2 << level)) + 1)
+#define MAX_MERGE(total, level) (((total) >> (((level) << 1) + 2)) + 1)

/* Heuristic value for the number of lines for which it is worth
creating a subthread, during an internal merge sort, on a machine
@@ -237,10 +237,10 @@ struct merge_node
struct line **dest; /* Pointer to destination of merge. */
size_t nlo; /* Total Lines remaining from LO. */
size_t nhi; /* Total lines remaining from HI. */
- size_t level; /* Level in merge tree. */
struct merge_node *parent; /* Parent node. */
+ unsigned int level; /* Level in merge tree. */
bool queued; /* Node is already in heap. */
- pthread_spinlock_t *lock; /* Lock for node operations. */
+ pthread_spinlock_t lock; /* Lock for node operations. */
};

/* Priority queue of merge nodes. */
@@ -3156,7 +3156,7 @@ compare_nodes (void const *a, void const *b)
static inline void
lock_node (struct merge_node *node)
{
- pthread_spin_lock (node->lock);
+ pthread_spin_lock (&node->lock);
}

/* Unlock a merge tree NODE. */
@@ -3164,7 +3164,7 @@ lock_node (struct merge_node *node)
static inline void
unlock_node (struct merge_node *node)
{
- pthread_spin_unlock (node->lock);
+ pthread_spin_unlock (&node->lock);
}

/* Destroy merge QUEUE. */
@@ -3240,69 +3240,38 @@ write_unique (struct line const *line, FILE *tfp, char const *temp_output)
/* Merge the lines currently available to a NODE in the binary
merge tree, up to a maximum specified by MAX_MERGE. */

-static size_t
+static void
mergelines_node (struct merge_node *restrict node, size_t total_lines,
FILE *tfp, char const *temp_output)
{
- struct line *lo_orig = node->lo;
- struct line *hi_orig = node->hi;
+ bool merge_to_dest = (MERGE_ROOT < node->level);
+ struct line const *lo_orig = node->lo;
+ struct line const *hi_orig = node->hi;
+ struct line *dest = *node->dest;
size_t to_merge = MAX_MERGE (total_lines, node->level);
- size_t merged_lo;
- size_t merged_hi;

- if (node->level > MERGE_ROOT)
+ while (to_merge--)
{
- /* Merge to destination buffer. */
- struct line *dest = *node->dest;
- while (node->lo != node->end_lo && node->hi != node->end_hi && to_merge--)
- if (compare (node->lo - 1, node->hi - 1) <= 0)
- *--dest = *--node->lo;
- else
- *--dest = *--node->hi;
-
- merged_lo = lo_orig - node->lo;
- merged_hi = hi_orig - node->hi;
-
- if (node->nhi == merged_hi)
- while (node->lo != node->end_lo && to_merge--)
- *--dest = *--node->lo;
- else if (node->nlo == merged_lo)
- while (node->hi != node->end_hi && to_merge--)
- *--dest = *--node->hi;
- }
- else
- {
- /* Merge directly to output. */
- while (node->lo != node->end_lo && node->hi != node->end_hi && to_merge--)
+ bool lo_exhausted = (node->lo == node->end_lo);
+ bool hi_exhausted = (node->hi == node->end_hi);
+ int cmp = lo_exhausted - hi_exhausted;
+ if (! cmp)
{
- if (compare (node->lo - 1, node->hi - 1) <= 0)
- write_unique (--node->lo, tfp, temp_output);
- else
- write_unique (--node->hi, tfp, temp_output);
+ if (lo_exhausted)
+ break;
+ cmp = compare (node->lo - 1, node->hi - 1);
}

- merged_lo = lo_orig - node->lo;
- merged_hi = hi_orig - node->hi;
-
- if (node->nhi == merged_hi)
- {
- while (node->lo != node->end_lo && to_merge--)
- write_unique (--node->lo, tfp, temp_output);
- }
- else if (node->nlo == merged_lo)
- {
- while (node->hi != node->end_hi && to_merge--)
- write_unique (--node->hi, tfp, temp_output);
- }
- node->dest -= lo_orig - node->lo + hi_orig - node->hi;
+ struct line const *lesser = (cmp <= 0 ? --node->lo : --node->hi);
+ if (merge_to_dest)
+ *--dest = *lesser;
+ else
+ write_unique (lesser, tfp, temp_output);
}

- /* Update NODE. */
- merged_lo = lo_orig - node->lo;
- merged_hi = hi_orig - node->hi;
- node->nlo -= merged_lo;
- node->nhi -= merged_hi;
- return merged_lo + merged_hi;
+ *node->dest = dest;
+ node->nlo -= lo_orig - node->lo;
+ node->nhi -= hi_orig - node->hi;
}

/* Insert NODE into QUEUE if it passes insertion checks. */
@@ -3328,13 +3297,11 @@ check_insert (struct merge_node *node, struct merge_node_queue *queue)
/* Update parent merge tree NODE. */

static void
-update_parent (struct merge_node *node, size_t merged,
- struct merge_node_queue *queue)
+update_parent (struct merge_node *node, struct merge_node_queue *queue)
{
if (node->level > MERGE_ROOT)
{
lock_node (node->parent);
- *node->dest -= merged;
check_insert (node->parent, queue);
unlock_node (node->parent);
}
@@ -3364,10 +3331,9 @@ merge_loop (struct merge_node_queue *queue,
queue_insert (queue, node);
break;
}
- size_t merged_lines = mergelines_node (node, total_lines, tfp,
- temp_output);
+ mergelines_node (node, total_lines, tfp, temp_output);
check_insert (node, queue);
- update_parent (node, merged_lines, queue);
+ update_parent (node, queue);

unlock_node (node);
}
@@ -3442,10 +3408,9 @@ sortlines (struct line *restrict lines, struct line *restrict dest,
struct line *lo = dest - total_lines;
struct line *hi = lo - nlo;
struct line **parent_end = (lo_child)? &parent->end_lo : &parent->end_hi;
- pthread_spinlock_t lock;
- pthread_spin_init (&lock, PTHREAD_PROCESS_PRIVATE);
struct merge_node node = {lo, hi, lo, hi, parent_end, nlo, nhi,
- parent->level + 1, parent, false, &lock};
+ parent, parent->level + 1, false, };
+ pthread_spin_init (&node.lock, PTHREAD_PROCESS_PRIVATE);

/* Calculate thread arguments. */
unsigned long int lo_threads = nthreads / 2;
@@ -3481,7 +3446,7 @@ sortlines (struct line *restrict lines, struct line *restrict dest,
merge_loop (merge_queue, total_lines, tfp, temp_output);
}

- pthread_spin_destroy (&lock);
+ pthread_spin_destroy (&node.lock);
}

/* Scan through FILES[NTEMPS .. NFILES-1] looking for a file that is
@@ -3758,16 +3723,15 @@ sort (char * const *files, size_t nfiles, char const *output_file,
struct merge_node_queue merge_queue;
queue_init (&merge_queue, 2 * nthreads);

- pthread_spinlock_t lock;
- pthread_spin_init (&lock, PTHREAD_PROCESS_PRIVATE);
struct merge_node node =
{NULL, NULL, NULL, NULL, NULL, buf.nlines,
- buf.nlines, MERGE_END, NULL, false, &lock};
+ buf.nlines, NULL, MERGE_END, false, };
+ pthread_spin_init (&node.lock, PTHREAD_PROCESS_PRIVATE);

sortlines (line, line, nthreads, buf.nlines, &node, true,
&merge_queue, tfp, temp_output);
queue_destroy (&merge_queue);
- pthread_spin_destroy (&lock);
+ pthread_spin_destroy (&node.lock);
}
else
write_unique (line - 1, tfp, temp_output);
Jim Meyering
2010-11-30 21:41:11 UTC
Permalink
Post by Jim Meyering
Post by Paul Eggert
Could you please try this little patch? It should fix your
problem. I came up with this fix in my sleep (literally!
I woke up this morning and the patch was in my head), but
haven't had time to look at the code in this area to see
if it's the best fix.
Clearly there's at least one more bug as noted in my previous email
<http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00216.html>
but I expect it's less likely to fire.
diff --git a/src/sort.c b/src/sort.c
index 7e25f6a..1aa1eb4 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3226,13 +3226,13 @@ queue_pop (struct merge_node_queue *queue)
static void
write_unique (struct line const *line, FILE *tfp, char const *temp_output)
{
- static struct line const *saved = NULL;
+ static struct line saved;
if (!unique)
write_line (line, tfp, temp_output);
- else if (!saved || compare (line, saved))
+ else if (!saved.text || compare (line, &saved))
{
- saved = line;
+ saved = *line;
write_line (line, tfp, temp_output);
}
}
Nice.
That looks right to me.
FYI, what happens is the first fillbuf call gets a block
of lines, and "saved" ends up pointing at one of them.
The second fillbuf's fread races to overwrite a byte or two of the
saved->text pointer that is being dereferenced by the other thread.
...

Hi Paul,

I'm getting ready to push something like the following
(I'll add a NEWS entry first, though)
Is there anything you'd like to add?
Jim Meyering
2010-11-29 20:14:03 UTC
Permalink
Post by Paul Eggert
Could you please try this little patch? It should fix your
problem. I came up with this fix in my sleep (literally!
I woke up this morning and the patch was in my head), but
haven't had time to look at the code in this area to see
if it's the best fix.
Clearly there's at least one more bug as noted in my previous email
<http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00216.html>
but I expect it's less likely to fire.
I haven't tried to trigger that one yet.
Have you?
Paul Eggert
2010-11-29 22:46:35 UTC
Permalink
Post by Jim Meyering
I haven't tried to trigger that one yet.
Have you?
No, sorry, haven't had time. My current guess, by the way,
is that it's not a bug that can be triggered: it's merely
useless code that is harmless and can safely be removed.
(This is a guess that I also came up with in my sleep, so
take it for what it's worth. :-)
Paul Eggert
2010-12-04 08:43:18 UTC
Permalink
Post by Paul Eggert
My current guess, by the way,
is that it's not a bug that can be triggered: it's merely
useless code that is harmless and can safely be removed.
I removed it as part of the following series of cleanup
patches. These are intended merely to refactor the code
and simplify it a bit, to make it easier to fix the CPU
spinlock bug. Please feel free to undo anything that
looks at all questionable.

While we're on that subject, I assume that
if we want to fix it by doing the last pass in memory rather
than to stdout, we need to allocate a bit more memory at
the start, basically N * (1 + log2(N)) cells instead of
N (log2(N)) cells. This is enough to hold the extra pass
of output. If this isn't right, Chen, please let me know.
Jim Meyering
2010-12-05 11:21:01 UTC
Permalink
Post by Paul Eggert
Post by Paul Eggert
My current guess, by the way,
is that it's not a bug that can be triggered: it's merely
useless code that is harmless and can safely be removed.
I removed it as part of the following series of cleanup
patches. These are intended merely to refactor the code
and simplify it a bit, to make it easier to fix the CPU
spinlock bug. Please feel free to undo anything that
looks at all questionable.
Hi Paul,

Thanks for all the clean-up.

I have no idea if the following is as a result of your changes,
since the segfault failure has been hard to reproduce.
It is from the sort-compress test, and has happened so far
only twice during "make -j9 check" on a quad-core F14 system:

Core was generated by `sort --compress-program=./dzip -S 1k in'.
Program terminated with signal 11, Segmentation fault.
#0 queue_check_insert (queue=0x7fffdbdc5620, node=0x5) at sort.c:3322
3322 if (! node->queued)
(gdb) p node
$1 = (struct merge_node *) 0x5
(gdb) bt
#0 queue_check_insert (queue=0x7fffdbdc5620, node=0x5) at sort.c:3322
#1 0x00000000004055a9 in queue_check_insert_parent (
lines=<value optimized out>, dest=<value optimized out>,
nthreads=140173261458952, total_lines=10, parent=<value optimized out>,
lo_child=<value optimized out>, queue=0x7fffdbdc5620, tfp=0x1c2fb90,
temp_output=0x1c2f72c "./sortpns55x") at sort.c:3340
#2 merge_loop (lines=<value optimized out>, dest=<value optimized out>,
nthreads=140173261458952, total_lines=10, parent=<value optimized out>,
lo_child=<value optimized out>, queue=0x7fffdbdc5620, tfp=0x1c2fb90,
temp_output=0x1c2f72c "./sortpns55x") at sort.c:3374
#3 sortlines (lines=<value optimized out>, dest=<value optimized out>,
nthreads=140173261458952, total_lines=10, parent=<value optimized out>,
lo_child=<value optimized out>, queue=0x7fffdbdc5620, tfp=0x1c2fb90,
temp_output=0x1c2f72c "./sortpns55x") at sort.c:3515
#4 0x00000000004059cb in sortlines_thread (data=<value optimized out>)
at sort.c:3428
#5 0x0000003f49806d5b in start_thread () from /lib64/libpthread-2.12.90.so
#6 0x0000003f48ce4aad in clone () from /lib64/libc-2.12.90.so

However, there is another failure that makes me suspicious:
(also based on sort-compress):

seq -w 200000 > exp && tac exp > in
PATH=.:$PATH ./sort --compress-program=dzip -S 1k in > out

That gets stuck in waitpid (from sort.c's reap), waiting for a
dzip invocation that appears will never terminate. This is also
on that same 4-core system, and is relatively easy to reproduce,
so it should be easy to identify the offending change, but I'm
out of time, now.

The hang is also reproducible with just 2000 input lines,
but then it doesn't arise as consistently.

I'll note in passing that the spinlock CPU utilization problem
is particularly noticeable when using --compress-program= because
there is a lot more waiting.
Paul Eggert
2010-12-07 02:23:38 UTC
Permalink
Post by Jim Meyering
seq -w 200000 > exp && tac exp > in
PATH=.:$PATH ./sort --compress-program=dzip -S 1k in > out
That gets stuck in waitpid (from sort.c's reap), waiting for a
dzip invocation that appears will never terminate. This is also
on that same 4-core system, and is relatively easy to reproduce,
so it should be easy to identify the offending change, but I'm
out of time, now.
I've verified this bug. It occurs regardless of whether the
December 3 changes are applied, so it looks like it was an
earlier change. I've also verified that it occurs with
--parallel=1 so it seems unlikely it would be thread-related.

What happens is that 'sort' creates a pipe to a compressor
but then never closes the pipe. When it waits for the compressor
to finish there is an obvious deadlock.

I'll see if I can track it down, but my guess it's somewhere
in the compressor/decompressor code. It's doing some lazy
evaluation and I wouldn't be surprised if that goes haywire
in some cases.

I'd guess there's also a bug in the thread code too,
having to do with your core dump (and Chen's; could be
two bugs).
Loading...