Oned 5.6.2-2 memory consuming problem

(Leonid) #1

Hi guys,

we use 8 node raft cluster with opennebula 5.6
yesterday we faced a situation where the oned process on the master node starts consuming too much memory in the system slice until the OOM Killer came and did his dirty work, after that the election process started and everything happened again on the other node

[Z0][InM][D]: Monitoring host BQB974790004-22 (0)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-20 (1)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-26 (2)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-26 (3)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-24 (4)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-20 (5)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-24 (6)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-22 (7)
[Z0][InM][E]: Could not find information driver kvm
[Z0][AuM][E]: Auth Error: Could not find Authorization driver

[Z0][SQL][W]: Slow query (0.66s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 11962
[Z0][SQL][W]: Slow query (0.65s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 12118
[Z0][SQL][W]: Slow query (0.62s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 11002
[Z0][SQL][W]: Slow query (0.67s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 11350
[Z0][SQL][W]: Slow query (0.66s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 10582
[Z0][SQL][W]: Slow query (0.62s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 10042

[Z0][SQL][W]: Slow query (0.56s) detected: INSERT INTO logdb (log_index, term, sqlcmd, timestamp, fed_index) VALUES (43496486,9888,'eJyFVWtvozgU/SveT22lPIwxBCSElhJPgxpIFpNsR6MRyjTeSVQCGUI67f76tc3DJNPVqCo+557rY+Pca2KynHs+AUGULMCuOFXpsSgycFvstwOQbw5sAL4V2/cBOFWbipNsw1MORZ5We6GdRdp38Sh+5qxMz5yVxfkoQFHt6sjzfnsH1t58RSi4hYOb+7/u7Qme2BBCPEToZnDjzBY0cZ1g6kJnzJ9O5IXEvcpzxjLq0MRLiMtpDZwgTEOPz/njiz/1Eu/Ly+vh61eX+zTx9f/obXzu0SQNF1GaBNxeMwzdsrCm8wUuFcefr2hC4rTeZo+1ins4vQxPu7fh9lunu/LdUjrzYm4xDehjuqLeAxEWPeaEJGyghgwL2RpfX8Ucf7lqoAnF4h11kkXizVOe6hrIgpaGsOWMVbDR+QTXssRUxZ3Qe0rFHviSmoEMvmAbkJL0hHwrBka4FqWjANIPC7+WOZ9iQho7aEKb719FalHMxraJNN0yzEaWjhJJS1Os1FFnRcm08TSwrjtjFag1MV3XLHOi22ajSkeJhIU2aePSMF5FURA9pOuQihLqU0dUB00WMeF43CdLP0inZB34Urlk/V9XuvB60DStLmOBcYPHUk1IyBtOlK0X+zNVlG+WmZpY1mUt8N3SJSG9ukUT25YJSvpEIp+kwVIlaXCk8X/dGjVuKkXstO6gNvmqwWS+ypp9XpJ4HdBF/Gvv9DWFf92JMYIjC11NkWm/69pH3p6iysPFlMypysOWCY4sr/bnQzuiFugtOJYFeC5Ktj0XIEcTKAkS7Ac7nHUE+Gr8+Xw8m3hY7lhm9LApVBPLVD74RV4WDCxZXr7nIGK7TcYO4G92qg6sZIBu8u37fbnffmcgeG3RbHP6ybJsmBcJfWoZuC+LzbYX7zigL+/Z5oUN/WzP36CjlJWvrASbapcVOTjuWF4cwOJYsZLfwA+agkhBXUGsoAHI8rMvT/j6ZB/lRejPgoj0zvn4PNxjDP95k2cyGeERBMdncB2GPCwPbaR3CHcIykmXM7QPYqiO/dCNJqDMjGtJuHPWyGaXqHUIfbACN+xeXr2sPILLngjyimW38R14YkUuRn5UgBhDZNo2eMXgT4BGCD7M/pV2PYOIJPGT8qmXa4J8SD7SRJDfMLwv2vupK3NYJ12qHZN3XHcxGJrJPxn4coJMWZOYBotI5Rr8xOp+7KTffR/H6s6St4N7M0AD9ZkcQP6nySe8+w8TvWgZ',0,-1)

After we reduced raft cluster to 3 members, the situation has not changed. here is our raft config

RAFT = [
    LIMIT_PURGE          = 100000,
    LOG_RETENTION        = 1000,
    LOG_PURGE_TIMEOUT    = 600,
    ELECTION_TIMEOUT_MS  = 10000,
    BROADCAST_TIMEOUT_MS = 500,
    XMLRPC_TIMEOUT_MS    = 1500
]

Please help in advice!

0 Likes

(Leonid) #2

Hi everyone,

we find out that the problem is a lot of records in the history table. for some reasons, 2 vms have 13k+ records about monitoring:

| 118 | 13581 | 1553885703 | 1553885765 |
| 118 | 13582 | 1553885886 | 1553885947 |
| 118 | 13583 | 1553886069 | 1553886132 |
| 118 | 13584 | 1553886250 | 1553886311 |
| 118 | 13585 | 1553886651 | 1553886728 |
| 118 | 13586 | 1553886870 | 1553886909 |
| 118 | 13587 | 1553887050 | 1553887092 |
| 118 | 13588 | 1553887227 | 1553887277 |
| 118 | 13589 | 1553887410 | 1553887599 |
| 118 | 13590 | 1553887700 | 1553887726 |
| 118 | 13591 | 1553887820 | 1553887843 |
| 118 | 13592 | 1553888096 | 1553888143 |
| 118 | 13593 | 1553888400 | 1553888462 |
| 118 | 13594 | 1553888583 | 1553888818 |


 12 -    -     node03 monitor       0  03/01 16:58:22   0d 00h05m   0h00m00s
 13 -    -     node03 monitor       0  03/05 15:07:14   0d 00h00m   0h00m00s
 14 -    -     node03 monitor       0  03/05 15:09:15   0d 00h00m   0h00m00s
 15 -    -     node03 monitor       0  03/05 15:11:15   0d 00h00m   0h00m00s
 16 -    -     node03 monitor       0  03/05 15:13:15   0d 00h03m   0h00m00s
 17 -    -     node03 monitor       0  03/05 15:16:53   0d 00h00m   0h00m00s
 18 -    -     node03 monitor       0  03/05 15:18:53   0d 00h00m   0h00m00s
 19 -    -     node03 monitor       0  03/05 15:20:54   0d 00h00m   0h00m00s
 20 -    -     node03 monitor       0  03/05 15:22:54   0d 00h00m   0h00m00s
 21 -    -     node03 monitor       0  03/05 15:24:54   0d 00h00m   0h00m00s
 22 -    -     node03 monitor       0  03/05 15:26:57   0d 00h00m   0h00m00s
 23 -    -     node03 monitor       0  03/05 15:28:55   0d 00h00m   0h00m00s
 24 -    -     node03 monitor       0  03/05 

so we tried to delete some of the records, but it led to the unavailability of the properties of the VM.

How we can reduce the number of these records and keep vm properties available?

Help advise.

0 Likes