Re: [GENERAL] openvz and shared memory trouble

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/31/2014 04:12 AM, Willy-Bas Loos wrote:

On Sat, Mar 29, 2014 at 6:17 PM, Adrian Klaver
<adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>> wrote:

    On 03/29/2014 08:19 AM, Willy-Bas Loos wrote:

        The error that shows up is a Bus error.
        That's on the replication slave.
        Here's the log about it:
        2014-03-29 12:41:33 CET db: ip: us: FATAL:  could not receive
        data from
        WAL stream: server closed the connection unexpectedly
                  This probably means the server terminated abnormally
                  before or while processing the request.

        cp: cannot stat
        `/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A':
        No
        such file or directory
        2014-03-29 12:41:33 CET db: ip: us: LOG:  unexpected pageaddr
        71/E9DA0000 in log file 114, segment 10, offset 14286848
        cp: cannot stat
        `/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A':
        No
        such file or directory
        2014-03-29 12:41:33 CET db: ip: us: LOG:  streaming replication
        successfully connected to primary
        2014-03-29 12:41:48 CET db: ip: us: LOG:  startup process (PID
        17452)
        was terminated by signal 7: Bus error
        2014-03-29 12:41:48 CET db: ip: us: LOG:  terminating any other
        active
        server processes
        2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos WARNING:
        terminating connection because of crash of another server process
        2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos DETAIL:  The
        postmaster has commanded this server process to roll back the
        current
        transaction and exit, because another server process exited
        abnormally
        and possibly corrupted shared memory.
        2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos HINT:  In a
        moment you should be able to reconnect to the database and
        repeat your
        command.


    Well what I am seeing are WAL log errors. One saying no file is
    present, the other pointing at a possible file corruption.

Those are normal notices, nothing to worry about.

Well other then they cause the standby to reconnect to the primary, during which a crash occurs.


    Shared memory problems are offered as a possible cause only. Right
    now I would say we are seeing only half the picture. The Postgres
    logs from the same time period for the primary server, as well as
    the system logs for the openvz container would help fill in the
    other half of the picture.


Here's the log from the primary postgres server:
2014-03-29 12:41:29 CET db:wbloos ip:[local] us:wbloos NOTICE:  ALTER
TABLE will create implicit sequence "test_x_seq" for serial column "test.x"
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG:  SSL renegotiation failure
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG:  SSL error: unexpected record
2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG:  could not send data to client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG:  could not receive data from client: Connection reset by peer
2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication
LOG:  unexpected EOF on standby connection

(the SSL renegotiation failure happens all the time, without the crash)

And here's the syslog form the container:
Mar 29 12:41:01 mycontainer snmpd[8819]: Connection from UDP:
[xxx.xxx.xxx.xxx]:59090->[xxx.xxx.xxx.xxx]
Mar 29 12:42:30 mycontainer snmpd[8819]: Connection from UDP:
[xxx.xxx.xxx.xxx]:35949->[xxx.xxx.xxx.xxx]

The log on the host doesn't say anything interesting either.

    A cursory look at memory management in openvz shows it is different
    from other virtualization software and physical machines. Whether
    that is a problem would seem to be dependent on where you are on the
    learning curve:)

That sounds like "there is a solution to the problem, all you have to do
is find out what it is". There doesn't seem to be a variable in the
beancounters or anywhere else that can prevent the bus error from happening.
There's seems to be no separate way of guaranteeing shared memory.
There's no OOM killer active either, nor is host or server running short
of memory.

At this point I am not sure it is even obvious what is causing the error, so finding a solution would be a hit or miss affair at best.


I'm still worried that it's like Tom Lane said in another discussion:"So
basically, you've got a broken kernel here: it claimed to give PG circa
(135MB) of memory, but what's actually there is only about (128MB). I
don't see any connection between those numbers and the shmmax/shmall
settings, either --- so I think this must be some busted implementation
of a VM-level limitation."
(here:
http://www.postgresql.org/message-id/CAK3UJREBcyVBtr8D7vMfU=uDdkjXkrPnGcuy8erYB0tMfKe1LA@xxxxxxxxxxxxxx)

And it makes me wonder what else may be issues that arise from that. But
especially, what i can do about it.

I do not use openvz so I do not have a test bed to try out, but this page seems to be related to your problem:

http://openvz.org/Resource_shortage

or if you want more detail and a link to what looks to a replacement for beancounters:

http://openvz.org/Setting_UBC_parameters


Cheers,

WBL

--
"Quality comes from focus and clarity of purpose" -- Mark Shuttleworth


--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx


--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux