5.4.1 memory issue in oned [closed]

sulaweyo · September 25, 2017, 6:03am

Hi,

i have updated our staging environment to 5.4.1 last week and i currently have the behavior that all 3 nodes of the cluster run out of memory within one day:

The nodes have 4G mem each and basically no load at all, there is one offline kvm host connected (only started when needed for a test) and that’s it. No deployments or even user actions during the weekend.

All memory is allocated by oned but unfortunately i did not have debug log level active so the logs don’t show anything except “mark” and “Purging obsolete LogDB records” until it reaches the point where the first node runs out of memory. At this point on the current leader i get Segfaults and core dumps from the ruby processes in the log but imo this is just the final symptom of running out of memory and swap.

By running onedb fsck on one node i ran into this:
pool_control for table image_pool has last_oid 470, but it is 477

Mysql2::Error: Unknown column 'tablename' in 'where clause'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:120:in `_query'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:120:in `block in query'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:119:in `handle_interrupt'
/usr/lib/ruby/gems/2.4.0/gems/mysql2-0.4.9/lib/mysql2/client.rb:119:in `query'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:137:in `block in _execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/logging.rb:38:in `log_connection_yield'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:132:in `_execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/utils/mysql_mysql2.rb:38:in `block in execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `block in synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/connection_pool/threaded.rb:87:in `hold'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/utils/mysql_mysql2.rb:38:in `execute'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/adapters/mysql2.rb:68:in `execute_dui'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/dataset/actions.rb:1092:in `execute_dui'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/dataset/actions.rb:863:in `update'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:57:in `block (2 levels) in fix_pool_control'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:55:in `each'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:55:in `block in fix_pool_control'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:204:in `_transaction'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:179:in `block in transaction'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `block in synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/connection_pool/threaded.rb:91:in `hold'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/connecting.rb:264:in `synchronize'
/usr/lib/ruby/gems/2.4.0/gems/sequel-5.0.0/lib/sequel/database/transactions.rb:145:in `transaction'
/usr/lib/one/ruby/onedb/fsck/pool_control.rb:54:in `fix_pool_control'
/usr/lib/one/ruby/onedb/fsck.rb:236:in `fsck'
/usr/lib/one/ruby/onedb/onedb.rb:280:in `fsck'
/usr/bin/onedb:397:in `block (2 levels) in <main>'
/usr/lib/one/ruby/cli/command_parser.rb:449:in `run'
/usr/lib/one/ruby/cli/command_parser.rb:76:in `initialize'
/usr/bin/onedb:286:in `new'
/usr/bin/onedb:286:in `<main>'

Error running fsck version 5.4.1

Production which uses the same setup but is running on 5.4.0 does not show any issue like that at all.

I have enabled debug logging on all nodes so maybe i can soon add some more information. Did anybody already observe anything like that or any idea?

ruben · September 25, 2017, 10:23am

Hi

In our long running tests the virt/res size for oned is stable
~3500M/~250M. Could you check the res size of oned in your setup? for just
oned 4G is more than enough…

sulaweyo · September 25, 2017, 10:33am

Hi ruben,

in production where we really have load and days of uptime we have 1862M/217M

In staging with uptime of 3.5h the current leader has 1373M/635M
A node i restarted an hour later is already at 1274M/250M

There is no load on these nodes and the failure of onedb fsck makes me suspicious that something is wrong in the db and the memory fills up with some kind backlog or transaction log or something like that. I did not yet find anything usefull so far in the log though

2h later:
leader: 1393M/954M
I think it’s db related but until i can get this fixed or at least know what is the root cause it blocks our production update to 5.4.1

sulaweyo · September 26, 2017, 5:18am

I have now recreated the staging cluster with the current production db (except vms and hosts) and fsck runs through now. I have to wait for a bit longer to see if the memory behaves normal now.

Edit: does not look like anything has changed fundamentally with the fresh cluster. The staging environment on 5.4.1 already uses more memory than production on 5.4.0. Both environments are based on the same data, staging is stripped from vms and hosts though.

After roughly 20h oned now uses 4439M/2729M

sulaweyo · September 29, 2017, 5:18am

The behavior is the same after replacing all nodes as well. After roughly one day all memory is locked up in oned and all swap is used as well. I don’t see anything in the log that would indicate an error. I’m open for any ideas as this is blocking our production upgrade to 5.4.1.

ruben · September 29, 2017, 8:52am

Hi,

OK, I’m trying to reproduce this, could you send:

1.- The output of ps aux --sort -rss for the hosts in the cluster (please
specify the role of the hosts leader/follower)

2.- How many hosts, vms are running in the system

3.- What kind of operations are you performing in the cluster

4.- The output of onezone show <zone_id>

5.- oned.log for the nodes of the cluster

If some if these contains private date feel free to PM me

Cheers

sulaweyo · September 29, 2017, 10:03am

Thank you very much - pm sent

sulaweyo · October 12, 2017, 11:17am

My issue with oned memory is not caused by OpenNebula but actually some OS libs. Thx Ruben for your help in tracking it down.

ruben · October 13, 2017, 8:40am

Thanks for reporting back!!. Just to point out that there have been a
couple of memory leaks solved thanks to the feedback provided by sulaweyo
Anybody interested in those need to update to branch one-5.4 upstream

Cheers

Topic		Replies	Views
Upgrade from ONE 4.14.2 to 5.(0.2) strongly increases amount of MySQL operations Community Support	10	1129	September 8, 2016
One error (during test upgrade from 5.4.0 to 5.4.1) Development	2	858	September 21, 2017
Address family mismatch - onedb fsck / upgrade Community Support	2	1070	July 29, 2016
One 5.12.01(ce): [SQL_Error]Unknown column 'last_poll' in 'field list' Community Support	2	343	July 22, 2021
VM not being monitored - DB issue Community Support	1	488	January 26, 2016

5.4.1 memory issue in oned [closed]

Related Topics