Re: MultiXactId Error in Autovacuum

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all, 

sorry for not getting back earlier. I wanted to let you know that we found the reason for our error (and a possible solution for our problem) in the meantime: 

The starting point of our worries was a log file error message: "ERROR: MultiXactId 2683601542 does no longer exist -- apparent wraparound". 
Autovacuum never finished on one of our tables because of this.

So we started to investigate and found out:  

We had a few rows in our table that suffered from some inconsistencies in its rowlock status, probably caused by a disc crash a few month ago.
Statements like "SELECT * FROM mytable" or autovacuum run into an error and exit when reading one of these inconstistent rows.

We digged deeper: Those rows were marked already DEAD but still pointed to MultiXactIds that were not listed in
pg_multixact anymore (what probably messed up the lock information)

The interesting point is: Different statements read those information in a different order.

  (a) Statements that select a single row like "SELECT * FROM mytable WHERE id = 'xxxx'" would recognize the status DEAD 
       BEFORE checking the LOCK status. Those statements just return 0 rows.

  (b) Statements that read every row like "SELECT id FROM mytable" process the LOCK information BEFORE checking if a row is already marked DEAD.
      If there is something wrong (like in our case due to inconsistent lock information) the statement quits with an ERROR. 
      In our case autovacuum always died because of this.

The trick was patching postgres for case (b) in a way that it would just print a WARNING and the row ID for these defect rows and continue.

Finally we knew the IDs of the defect rows (we had to install the contrib module pgrowlocks to find the locked rows) using the query
"SELECT id FROM mytable AS m, pgrowlocks('mytable') AS p WHERE p.locked_row = m.ctid".
Just deleting these rows using "DELETE FROM mytable WHERE id = 'xxxx'" did not work because of case (a). It returned 0 rows.
So we copied all but the defect rows into a new table, dropped the old table, and renamed the new one - quite an effort for finally deleting two rows :-)

We are wondering it there is a deeper reason behind treating case (a) and (b) in a different way or if it would be better to get rid of these inconsistencies in future postgres versions.

Thanks for all your input and help.
BR, 
Karl






On Wed, Nov 20, 2013 at 3:20 PM, Karl Hafner <karl@xxxxxxxxxxxxx> wrote:

Well, probably you are right, because PostgreSQL would shutdown
before a wraparound happens, and you would have noticed that.
 
I don't know how it happened, but your database has suffered
corruption.  Any chance you can restore from a backup?
Can you still "SELECT *" from the table that has the problem?
 
I cannot run a full SELECT * on that table. It stops after a few minutes with the same error.
I am currently testing different things on a DB created from a backup like exporting per month/per day … 
but I am running into this error again and again. 
Running a VACUUM --FULL gives me:  WARNING:  concurrent delete in progress within table "the_bad_table" … then it also breaks.


BR, 
Karl



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux