Hi guys,
we use 8 node raft cluster with opennebula 5.6
yesterday we faced a situation where the oned process on the master node starts consuming too much memory in the system slice until the OOM Killer came and did his dirty work, after that the election process started and everything happened again on the other node
[Z0][InM][D]: Monitoring host BQB974790004-22 (0)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-20 (1)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-26 (2)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-26 (3)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790004-24 (4)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-20 (5)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-24 (6)
[Z0][InM][E]: Could not find information driver kvm
[Z0][InM][D]: Monitoring host BQB974790003-22 (7)
[Z0][InM][E]: Could not find information driver kvm
[Z0][AuM][E]: Auth Error: Could not find Authorization driver
[Z0][SQL][W]: Slow query (0.66s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 11962
[Z0][SQL][W]: Slow query (0.65s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 12118
[Z0][SQL][W]: Slow query (0.62s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 11002
[Z0][SQL][W]: Slow query (0.67s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 11350
[Z0][SQL][W]: Slow query (0.66s) detected: SELECT body FROM history WHERE vid = 118 AND seq = 10582
[Z0][SQL][W]: Slow query (0.62s) detected: SELECT body FROM history WHERE vid = 137 AND seq = 10042
[Z0][SQL][W]: Slow query (0.56s) detected: INSERT INTO logdb (log_index, term, sqlcmd, timestamp, fed_index) VALUES (43496486,9888,'eJyFVWtvozgU/SveT22lPIwxBCSElhJPgxpIFpNsR6MRyjTeSVQCGUI67f76tc3DJNPVqCo+557rY+Pca2KynHs+AUGULMCuOFXpsSgycFvstwOQbw5sAL4V2/cBOFWbipNsw1MORZ5We6GdRdp38Sh+5qxMz5yVxfkoQFHt6sjzfnsH1t58RSi4hYOb+7/u7Qme2BBCPEToZnDjzBY0cZ1g6kJnzJ9O5IXEvcpzxjLq0MRLiMtpDZwgTEOPz/njiz/1Eu/Ly+vh61eX+zTx9f/obXzu0SQNF1GaBNxeMwzdsrCm8wUuFcefr2hC4rTeZo+1ins4vQxPu7fh9lunu/LdUjrzYm4xDehjuqLeAxEWPeaEJGyghgwL2RpfX8Ucf7lqoAnF4h11kkXizVOe6hrIgpaGsOWMVbDR+QTXssRUxZ3Qe0rFHviSmoEMvmAbkJL0hHwrBka4FqWjANIPC7+WOZ9iQho7aEKb719FalHMxraJNN0yzEaWjhJJS1Os1FFnRcm08TSwrjtjFag1MV3XLHOi22ajSkeJhIU2aePSMF5FURA9pOuQihLqU0dUB00WMeF43CdLP0inZB34Urlk/V9XuvB60DStLmOBcYPHUk1IyBtOlK0X+zNVlG+WmZpY1mUt8N3SJSG9ukUT25YJSvpEIp+kwVIlaXCk8X/dGjVuKkXstO6gNvmqwWS+ypp9XpJ4HdBF/Gvv9DWFf92JMYIjC11NkWm/69pH3p6iysPFlMypysOWCY4sr/bnQzuiFugtOJYFeC5Ktj0XIEcTKAkS7Ac7nHUE+Gr8+Xw8m3hY7lhm9LApVBPLVD74RV4WDCxZXr7nIGK7TcYO4G92qg6sZIBu8u37fbnffmcgeG3RbHP6ybJsmBcJfWoZuC+LzbYX7zigL+/Z5oUN/WzP36CjlJWvrASbapcVOTjuWF4cwOJYsZLfwA+agkhBXUGsoAHI8rMvT/j6ZB/lRejPgoj0zvn4PNxjDP95k2cyGeERBMdncB2GPCwPbaR3CHcIykmXM7QPYqiO/dCNJqDMjGtJuHPWyGaXqHUIfbACN+xeXr2sPILLngjyimW38R14YkUuRn5UgBhDZNo2eMXgT4BGCD7M/pV2PYOIJPGT8qmXa4J8SD7SRJDfMLwv2vupK3NYJ12qHZN3XHcxGJrJPxn4coJMWZOYBotI5Rr8xOp+7KTffR/H6s6St4N7M0AD9ZkcQP6nySe8+w8TvWgZ',0,-1)
After we reduced raft cluster to 3 members, the situation has not changed. here is our raft config
RAFT = [
LIMIT_PURGE = 100000,
LOG_RETENTION = 1000,
LOG_PURGE_TIMEOUT = 600,
ELECTION_TIMEOUT_MS = 10000,
BROADCAST_TIMEOUT_MS = 500,
XMLRPC_TIMEOUT_MS = 1500
]
Please help in advice!