Hi,
i have updated our staging environment to 5.4.1 last week and i currently have the behavior that all 3 nodes of the cluster run out of memory within one day:
The nodes have 4G mem each and basically no load at all, there is one offline kvm host connected (only started when needed for a test) and that’s it. No deployments or even user actions during the weekend.
All memory is allocated by oned but unfortunately i did not have debug log level active so the logs don’t show anything except “mark” and “Purging obsolete LogDB records” until it reaches the point where the first node runs out of memory. At this point on the current leader i get Segfaults and core dumps from the ruby processes in the log but imo this is just the final symptom of running out of memory and swap.
By running onedb fsck on one node i ran into this:
pool_control for table image_pool has last_oid 470, but it is 477
Mysql2::Error: Unknown column 'tablename' in 'where clause'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:120:in `_query'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:120:in `block in query'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:119:in `handle_interrupt'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:119:in `query'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:137:in `block in _execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/logging.rb:38:in `log_connection_yield'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:132:in `_execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/utils/mysql_mysql2.rb:38:in `block in execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `block in synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/connection_pool/threaded.rb:87:in `hold'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/utils/mysql_mysql2.rb:38:in `execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:68:in `execute_dui'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/dataset/actions.rb:1092:in `execute_dui'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/dataset/actions.rb:863:in `update'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:57:in `block (2 levels) in fix_pool_control'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:55:in `each'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:55:in `block in fix_pool_control'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:204:in `_transaction'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:179:in `block in transaction'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `block in synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/connection_pool/threaded.rb:91:in `hold'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:145:in `transaction'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:54:in `fix_pool_control'
/usr/lib/one/ruby/onedb/fsck.rb:236:in `fsck'
/usr/lib/one/ruby/onedb/onedb.rb:280:in `fsck'
/usr/bin/onedb:397:in `block (2 levels) in <main>'
/usr/lib/one/ruby/cli/command_parser.rb:449:in `run'
/usr/lib/one/ruby/cli/command_parser.rb:76:in `initialize'
/usr/bin/onedb:286:in `new'
/usr/bin/onedb:286:in `<main>'
Error running fsck version 5.4.1
Mysql itself does not reports all tables as ok with the largest being the logdb:
±-------------------±---------------------------------------------------±-----------+
| Database | Table | Size in MB |
±-------------------±---------------------------------------------------±-----------+
| opennebula | logdb | 939.03 |
| opennebula | host_monitoring | 1.52 |
Logdb indizes:
MariaDB [opennebula]> show index from logdb;
±------±-----------±--------------±-------------±------------±----------±------------±---------±-------±-----±-----------±--------±--------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
±------±-----------±--------------±-------------±------------±----------±------------±---------±-------±-----±-----------±--------±--------------+
| logdb | 0 | PRIMARY | 1 | log_index | A | 442740 | NULL | NULL | | BTREE | | |
| logdb | 1 | timestamp_idx | 1 | timestamp | A | 442740 | NULL | NULL | YES | BTREE | | |
| logdb | 1 | fed_index_idx | 1 | fed_index | A | 2 | NULL | NULL | YES | BTREE | | |
±------±-----------±--------------±-------------±------------±----------±------------±---------±-------±-----±-----------±--------±--------------+
3 rows in set (0.00 sec)
Production which uses the same setup but is running on 5.4.0 does not show any issue like that at all.
I have enabled debug logging on all nodes so maybe i can soon add some more information. Did anybody already observe anything like that or any idea?