[Server-cvs] engine/core server_engine.cpp, 1.14.2.4.14.2, 1.14.2.4.14.3
srao at helixcommunity.org srao at helixcommunity.orgUpdate of /cvsroot/server/engine/core
In directory cvs01.internal.helixcommunity.org:/tmp/cvs-serv30906
Modified Files:
Tag: SERVER_11_1
server_engine.cpp
Log Message:
Synopsis
========
After running for 3 days server is not responding to client requests.
Branches: SERVER_11_1_RN
Suggested Reviewer: Atin, Any one
Description
===========
After running for 3 days server is not responding to client requests.
>From code we observed that, this scenario can happen if select returns -1 in main loop.
Same we confirmed by modifying the select() call to returns -1 in main loop. Select() returns -1 if one of the provided descriptors is invalid.
Fix:
===============
Implemented a new function CallbackContainer::HandleBadFds() to delete bad descriptors. This functions gets called whenever select returns -1 in mainloop.
This function traverse through descriptor list and calls getsockname() for each descriptor. whenever getsockname() returns -1 this will trigger Callbacks::Remove() to remove corresponding descriptor. Existing Callbacks::Remove() function deletes descriptor only if that descriptor available in list m_map. But it may be possible that error descriptor may not be available in list because of corruption. So modified this function to remove error descriptor even when it is not available in list m_map.
the error statement inside the HandleBadFDs() will print error message only if getsockname() returns WSAENOTSOCK. That is the reason we are printing error message in main loop. This will print error message irrespective of the error code. Added error counter and printing error message only if error counter < 1000.
Files Affected
==============
/server/engine/core/server_engine.cpp
/server/engine/core/pub/platform/win/callback_container.h
/pub/platform/win/servcallback.h
Testing Performed
=================
As this is uptime issue, I modified code to select return -1 and verified that descriptor get deleted properly and main loop continue to accept new connections.
Build verified: win32-i386-vc7
QA Hints
===============
QA to do regression testing.
Index: server_engine.cpp
===================================================================
RCS file: /cvsroot/server/engine/core/server_engine.cpp,v
retrieving revision 1.14.2.4.14.2
retrieving revision 1.14.2.4.14.3
diff -u -d -r1.14.2.4.14.2 -r1.14.2.4.14.3
--- server_engine.cpp 15 Sep 2006 20:58:20 -0000 1.14.2.4.14.2
+++ server_engine.cpp 18 Jan 2007 07:23:03 -0000 1.14.2.4.14.3
@@ -87,6 +87,7 @@
#include "server_context.h"
#include "globals.h"
+#include "safestring.h"
#ifdef PAULM_SOCKTIMING
#include "sockettimer.h"
@@ -492,6 +493,8 @@
Timeval last_left_select_time;
Timeval last_in_select_time;
volatile unsigned int guard2 = 0xcc110088;
+ UINT32 ulErrCode = 0;
+ UINT32 ulErrCounter = 0;
#ifndef _WIN32
/*
@@ -686,6 +689,15 @@
n = callbacks.Select((struct timeval*)timeoutp);
+ if (n < 0)
+ {
+#ifdef _WIN32
+ ulErrCode = WSAGetLastError();
+#else
+ ulErrCode = errno;
+#endif
+ }
+
m_ulMainloopIterations++;
GETTIMEOFDAY(now);
@@ -703,6 +715,15 @@
m_pSCurrentElem = schedule.get_execute_list(now);
if (n < 0)
{
+ ulErrCounter++;
+ if (ulErrCounter < 1000)
+ {
+ char buf[256] = "\0";
+ SafeSprintf(buf, 256, "select error in mainloop = %u , timeout: %ld.%06ld, procnum = %d\n",
+ ulErrCode, timeout.tv_sec, timeout.tv_usec, proc->procnum());
+ proc->pc->error_handler->Report(HXLOG_ERR, HXR_FAIL, (ULONG32)HXR_FAIL, buf, 0);
+ }
+
m_bMoreReaderOrWriter = FALSE;
m_bMoreTSReaderOrWriter = FALSE;
@@ -714,6 +735,12 @@
callbacks.HandleBadFds(proc->pc->error_handler);
}
#endif
+#ifdef _WIN32
+ if (ulErrCode == WSAENOTSOCK)
+ {
+ callbacks.HandleBadFds(proc->pc->error_handler);
+ }
+#endif
}
else if (n == 0)
{