Collectd Segmentation Fault


(Rafael) #1

Hi!

I was looking into building the OpenNebula frontend from Alpine Linux, first on 86_x64, then on armhf.

Everything went pretty much OK in terms of building/installing.

However, I am dealing with a complex bug in execution.

I get the following segmentation fault:

Thread 6 "collectd" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 19301]
0x00005555555589fa in ListenerThread::monitor_loop (
    this=<error reading variable: Cannot access memory at address 0x7ffff7f81338>)
    at src/im_mad/collectd/ListenerThread.cc:55

This is basically a tread pool, each tread is run by a ListenerThread object. However, for some reason the “this” pointer becomes corrupt.

The thread spawning code in the ListenerPool seems OK, but something goes wrong between creating the thread and before calling ListenerThread.monitor_loop.

I have been debugging the code, until noticing something something strange. Despite setting the thread pool to 50, I notice that the ListenerThread constructor is getting called only once.

I find the following code as the place where ListenerThreads are instantiated:

    ListenerPool(int fd, int sock, size_t num)                                  
        :listeners(num, ListenerThread(sock)), out_fd(fd), socket(sock){};      

I am starting to wonder if this initialization results of a vector with num pointers to the same instance or num ListenerThread instances in a vector.

Can anyone familiar with the code point me in the right direction?

@ruben , do you know what the problem could be?


(Ruben S. Montero) #2

Could you please send the output of (in the gdb prompt):

thread apply all bt

(Rafael) #3

@ruben, here you have the output for 5 threads:

(gdb) run
Starting program: /srv/one/src/im_mad/collectd/collectd -t 5
[New LWP 883]
[New LWP 884]
[New LWP 885]
[New LWP 886]
[New LWP 887]
[New LWP 888]
[New LWP 889]

Thread 8 "collectd" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 889]
0x000055555555895e in ListenerThread::monitor_loop (this=<error reading variable: Cannot access memory at address 0x7ffff7f53338>) at src/im_mad/collectd/ListenerThread.cc:54
54	{
(gdb) thread apply all bt

Thread 8 (LWP 889):
#0  0x000055555555895e in ListenerThread::monitor_loop (this=<error reading variable: Cannot access memory at address 0x7ffff7f53338>)
    at src/im_mad/collectd/ListenerThread.cc:54
#1  0x0000555555558abd in listener_main (arg=0x7ffff7ffef40) at src/im_mad/collectd/ListenerThread.cc:85
#2  0x00007ffff7dc3a87 in ?? () from /lib/ld-musl-x86_64.so.1
#3  0x0000000000000000 in ?? ()

Thread 7 (LWP 888):
#0  0x00007ffff7dc5980 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007ffff7dc30f3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007ffff7f82b30 in ?? ()
#3  0x00007ffff7f6a344 in ?? ()
#4  0x0000000000000000 in ?? ()

Thread 6 (LWP 887):
#0  0x00007ffff7dc5980 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007ffff7dc30f3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007ffff7f99b30 in ?? ()
#3  0x00007ffff7f81344 in ?? ()
#4  0x0000000000000000 in ?? ()

Thread 5 (LWP 886):
#0  0x00007ffff7dc5980 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007ffff7dc30f3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007ffff7fb0b30 in ?? ()
#3  0x00007ffff7f98344 in ?? ()
#4  0x0000000000000000 in ?? ()

Thread 4 (LWP 885):
#0  0x00007ffff7dc5980 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007ffff7dc30f3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007ffff7fc7b30 in ?? ()
#3  0x00007ffff7faf344 in ?? ()
#4  0x0000000000000000 in ?? ()

Thread 3 (LWP 884):
#0  0x00007ffff7dc5980 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007ffff7dc30f3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007ffff7fdeb30 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 2 (LWP 883):
#0  0x00007ffff7dc5980 in __clone () from /lib/ld-musl-x86_64.so.1
---Type <return> to continue, or q <return> to quit---  
#1  0x00007ffff7dc30f3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007ffff7ff5b30 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 1 (LWP 879):
#0  0x00007ffff7dc5980 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007ffff7dc30f3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007ffff7ffdb90 in ?? () from /lib/ld-musl-x86_64.so.1
#3  0x0000000000000000 in ?? ()
(gdb) 

If you want to reproduce the problem yourself, try with this docker, the sources are in /srv/one.

docker run -it privazio/on_one-x86_64

I just tried to run the image with http://labs.play-with-docker.com/ and it also happens there.

Or if you want to build the container yourself, this is the Dockerfile:

FROM privazio/alpine-x86_64

RUN apk update \
    && apk add alpine-sdk ruby ruby-dev ruby-irb ruby-rdoc sqlite sqlite-dev \
           openssh-client scons flex bison libxml2-dev curl-dev \
           xmlrpc-c xmlrpc-c-dev openssl-dev bash\
    && gem install --no-ri --no-rdoc sqlite3 bundler rake xmlrpc

ENV CXXFLAGS="-g3 -fno-inline -O0"

RUN cd /srv \
    && git clone https://github.com/privazio/one.git \
    && cd /srv/one/share/install_gems \
    && git checkout alpine \
    && bundler install --system

RUN cd /srv/one \
    && scons

RUN adduser -S oneadmin oneadmin -h /var/lib/one \
    && cd /srv/one \
    && ./install.sh -u oneadmin \
    && echo "oneadmin ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

RUN apk add gdb

USER oneadmin

RUN mkdir /var/lib/one/.one \
    && echo "oneadmin:onepass" > /var/lib/one/.one/one_auth

I had to disable the mysql gem, because it doesn’t like some ruby API change…


(Rafael) #4

@ruben did you have a chance to look this up?

any idea what the problem could be?professional services


(Ruben S. Montero) #5

Sorry did not have time to check this. My best guess is that is something related to the use of pthread in OpenNebula and the arm implementation or the behavior of libc++.

As for the vector of threads, it is initialized by copying one element to the vector, thus the default copy constructor should be invoked. I only see a potential problem with the mutex attribute…


(Rafael) #6

@ruben, As for now I am testing on Alpine x86_64. This problem is also happening on arm. But lets make it first work on the x86_64, as that should be easier.

I did not notice the copy constructor…

“this” not been defined definitely sounds like some kind of issue with different library implementations.

I will see if I can find out something… but if you can, please, have a look, note that with:

docker run -it privazio/on_one-x86_64

you get an environment where you can reproduce the problem, sources are on /srv/one


(Ruben S. Montero) #7

I’ve tried the docker, I was not able to debug the code, it complain about not able to start the program and the like.

I’ve also added the copy constructor just in case, but got the same issue. Maybe you can try to add this in ListenerThread.h within class ListenerThread, as I’m not really sure if the environment I was using is reliable…

ListenerThread(const ListenerThread& o)                                                                       
{                                                                                                             
    pthread_mutex_init(&mutex,0);                                                                             
                                                                                                              
    monitor_data.clear();                                                                                     
                                                                                                              
    monitor_data.reserve(BUFFER_SIZE);                                                                        
}                   

Cheers


(Ranjith Kumar) #8

Hi. How collectd knows the ID and Name of the VM. Where it is stored in individual VMs?