Oned 5.6.2-2 memory consuming problem

Leonid · March 30, 2019, 11:31am

Hi guys,

we use 8 node raft cluster with opennebula 5.6
yesterday we faced a situation where the oned process on the master node starts consuming too much memory in the system slice until the OOM Killer came and did his dirty work, after that the election process started and everything happened again on the other node

[Z0][InM][D]: Monitoring host BQB974790004-22 (0)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-20 (1)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-26 (2)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-26 (3)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-24 (4)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-20 (5)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-24 (6)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-22 (7)
[Z0][InM][E]: Could not find information driver kvm
[Z0][AuM][E]: Auth Error: Could not find Authorization driver

[Z0][SQL][W]: Slow query (0.66s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 11962
[Z0][SQL][W]: Slow query (0.65s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 12118
[Z0][SQL][W]: Slow query (0.62s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 11002
[Z0][SQL][W]: Slow query (0.67s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 11350
[Z0][SQL][W]: Slow query (0.66s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 10582
[Z0][SQL][W]: Slow query (0.62s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 10042

[Z0][SQL][W]: Slow query (0.56s) detected: INSERT INTO logdb (log_index, term, sqlcmd, timestamp, fed_index) VALUES (43496486,9888,'eJyFVWtvozgU/SveT22lPIwxBCSElhJPgxpIFpNsR6MRyjTeSVQCGUI67f76tc3DJNPVqCo+557rY+Pca2KynHs+AUGULMCuOFXpsSgycFvstwOQbw5sAL4V2/cBOFWbipNsw1MORZ5We6GdRdp38Sh+5qxMz5yVxfkoQFHt6sjzfnsH1t58RSi4hYOb+7/u7Qme2BBCPEToZnDjzBY0cZ1g6kJnzJ9O5IXEvcpzxjLq0MRLiMtpDZwgTEOPz/njiz/1Eu/Ly+vh61eX+zTx9f/obXzu0SQNF1GaBNxeMwzdsrCm8wUuFcefr2hC4rTeZo+1ins4vQxPu7fh9lunu/LdUjrzYm4xDehjuqLeAxEWPeaEJGyghgwL2RpfX8Ucf7lqoAnF4h11kkXizVOe6hrIgpaGsOWMVbDR+QTXssRUxZ3Qe0rFHviSmoEMvmAbkJL0hHwrBka4FqWjANIPC7+WOZ9iQho7aEKb719FalHMxraJNN0yzEaWjhJJS1Os1FFnRcm08TSwrjtjFag1MV3XLHOi22ajSkeJhIU2aePSMF5FURA9pOuQihLqU0dUB00WMeF43CdLP0inZB34Urlk/V9XuvB60DStLmOBcYPHUk1IyBtOlK0X+zNVlG+WmZpY1mUt8N3SJSG9ukUT25YJSvpEIp+kwVIlaXCk8X/dGjVuKkXstO6gNvmqwWS+ypp9XpJ4HdBF/Gvv9DWFf92JMYIjC11NkWm/69pH3p6iysPFlMypysOWCY4sr/bnQzuiFugtOJYFeC5Ktj0XIEcTKAkS7Ac7nHUE+Gr8+Xw8m3hY7lhm9LApVBPLVD74RV4WDCxZXr7nIGK7TcYO4G92qg6sZIBu8u37fbnffmcgeG3RbHP6ybJsmBcJfWoZuC+LzbYX7zigL+/Z5oUN/WzP36CjlJWvrASbapcVOTjuWF4cwOJYsZLfwA+agkhBXUGsoAHI8rMvT/j6ZB/lRejPgoj0zvn4PNxjDP95k2cyGeERBMdncB2GPCwPbaR3CHcIykmXM7QPYqiO/dCNJqDMjGtJuHPWyGaXqHUIfbACN+xeXr2sPILLngjyimW38R14YkUuRn5UgBhDZNo2eMXgT4BGCD7M/pV2PYOIJPGT8qmXa4J8SD7SRJDfMLwv2vupK3NYJ12qHZN3XHcxGJrJPxn4coJMWZOYBotI5Rr8xOp+7KTffR/H6s6St4N7M0AD9ZkcQP6nySe8+w8TvWgZ',0,-1)

After we reduced raft cluster to 3 members, the situation has not changed. here is our raft config

RAFT = [
    LIMIT_PURGE          = 100000,
    LOG_RETENTION        = 1000,
    LOG_PURGE_TIMEOUT    = 600,
    ELECTION_TIMEOUT_MS  = 10000,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1500
]

Please help in advice!

Leonid · March 31, 2019, 11:38am

Hi everyone,

we find out that the problem is a lot of records in the history table. for some reasons, 2 vms have 13k+ records about monitoring:

| 118 | 13581 | 1553885703 | 1553885765 |
| 118 | 13582 | 1553885886 | 1553885947 |
| 118 | 13583 | 1553886069 | 1553886132 |
| 118 | 13584 | 1553886250 | 1553886311 |
| 118 | 13585 | 1553886651 | 1553886728 |
| 118 | 13586 | 1553886870 | 1553886909 |
| 118 | 13587 | 1553887050 | 1553887092 |
| 118 | 13588 | 1553887227 | 1553887277 |
| 118 | 13589 | 1553887410 | 1553887599 |
| 118 | 13590 | 1553887700 | 1553887726 |
| 118 | 13591 | 1553887820 | 1553887843 |
| 118 | 13592 | 1553888096 | 1553888143 |
| 118 | 13593 | 1553888400 | 1553888462 |
| 118 | 13594 | 1553888583 | 1553888818 |


 12 -    -     node03 monitor       0  03/01 16:58:22   0d 00h05m   0h00m00s
 13 -    -     node03 monitor       0  03/05 15:07:14   0d 00h00m   0h00m00s
 14 -    -     node03 monitor       0  03/05 15:09:15   0d 00h00m   0h00m00s
 15 -    -     node03 monitor       0  03/05 15:11:15   0d 00h00m   0h00m00s
 16 -    -     node03 monitor       0  03/05 15:13:15   0d 00h03m   0h00m00s
 17 -    -     node03 monitor       0  03/05 15:16:53   0d 00h00m   0h00m00s
 18 -    -     node03 monitor       0  03/05 15:18:53   0d 00h00m   0h00m00s
 19 -    -     node03 monitor       0  03/05 15:20:54   0d 00h00m   0h00m00s
 20 -    -     node03 monitor       0  03/05 15:22:54   0d 00h00m   0h00m00s
 21 -    -     node03 monitor       0  03/05 15:24:54   0d 00h00m   0h00m00s
 22 -    -     node03 monitor       0  03/05 15:26:57   0d 00h00m   0h00m00s
 23 -    -     node03 monitor       0  03/05 15:28:55   0d 00h00m   0h00m00s
 24 -    -     node03 monitor       0  03/05

so we tried to delete some of the records, but it led to the unavailability of the properties of the VM.

How we can reduce the number of these records and keep vm properties available?

Help advise.

Topic		Replies	Views
Opennebula 4.12 -- Error Monitoring Host Community Support	2	1671	January 23, 2017
VM crashes one by one at a time Community Support	9	1350	July 6, 2017
KVM node error - opennebula-node-kvm 6.0.0.2-1.ce General	1	279	June 7, 2021
Problem in running host in opennebula 5.0.2? Development	2	838	November 22, 2016
Capacity problem Community Support	6	916	February 17, 2016

Oned 5.6.2-2 memory consuming problem

Related Topics