Join 34,000+ subscribers and receive articles from our blog about software quality, testing, QA and security.
 

Restarting a hung console and loss of subsequent log messages


#1

I’ve reported this before but never really got a solution, it is pretty uncommon but can be serious for a production mission critical system.

Basically from time to time the console will freeze - there are are often a few apps running for several days before this happens, we understand this is a known behaviour for the console app when left running for days on end - but it can happen that staff just leave it running sometimes.

When we discover this we kill the console process (it is unresponsive at the GUI level) using task manager and then restart it.

The restarted console then begins and indicates (in its status bar) that there are client connections, in our case TCP (0) and Pipes (1) - but we expect to see Pipes (2) because there are two clients logging.

However we see no messages displayed in the console - we expect to see some as the app’s by design reports periodic status/health info and so on.

Thus under these circumstances we lose the ability to see vital application status info (like errors or exceptions etc) that might occur from time to time and this impedes our ability to provide prompt status info to support personnel or customers.

If we stop/start the application (a stable reliable service app) then the first log msg issued by the restarted server does appear in the console and all seems fine thereafter.

At this stage the status now shows TCP (0) and Pipes (2) - because there were (all along) two client apps.

The two applications are:

  1. A custom TCP/IP server service - mature, stable etc.
  2. A SQL Server instance running SQL/CLR managed procedures.

Restarting 1. is doable but undesirable (as we impact users etc) and restarting 2. is also highly undesirable - yet we currently have no way to resume seeing log msgs.

Now Dennis did suggest trapping an OnError event within a client process but we have not taken this step as it is not trivial (for example logging any such event info to a file from SQL/CLR is rather involved and itself needs to be tested due to file access permissions etc etc).

So is there any list of likely causes for what we are seeing?

What typically might we expect to see passed to the OnError event?

What typically causes this event to get invoked when developing/testing SI itself?

Has anyone else here ever had to trap and report anything in this OnError event?

Might this be a true bug somewhere in SI and if so, does the above info shed some light in what this might be?

Any help greatly appreciated.

Hugo


#2

Hello Hugo,

Thanks for your posting. In the Error event you would usually see the errors from the SmartInspect library that occur when the connection is interrupted/aborted for some reason (e.g. when the Console is restarted) or cannot be established in the first place. You are passed the original Exception object that is caught by the respective protocol in the library (pipe/TCP/file etc.).

We usually recommend using log files for production systems as this is the most reliable protocol and doesn’t need a running Console and you can’t run into memory limit issues.

Could you post your connection settings for your two applications (to see the ‘reconnect’ related options)? Also, if this situation happens again, could you take a look a the connection log in the Console (File | Connection Log) to check which of your applications can reconnect?

Regards,
Tobias


#3

[quote=tgurock]Hello Hugo,

Thanks for your posting. In the Error event you would usually see the errors from the SmartInspect library that occur when the connection is interrupted/aborted for some reason (e.g. when the Console is restarted) or cannot be established in the first place. You are passed the original Exception object that is caught by the respective protocol in the library (pipe/TCP/file etc.).

We usually recommend using log files for production systems as this is the most reliable protocol and doesn’t need a running Console and you can’t run into memory limit issues.

Could you post your connection settings for your two applications (to see the ‘reconnect’ related options)? Also, if this situation happens again, could you take a look a the connection log in the Console (File | Connection Log) to check which of your applications can reconnect?

Regards,
Tobias[/quote]

Thanks Tobias - I will take a look at the connection log (was unaware there was such a log) and see if that sheds any light.

While I’m on this subject - I’ve mentioned this before in e-mail exchanges with Dennis - but it really would be immensely helpful to eliminate the memory limit issues with the console.

Ideally the growth of the log file should not result in degradation of the console application itself.

Functionally it would be great if the console app could simply freely scroll up/down the log file and just read messages posted to the log as they get written, no matter how large the log file grows.

If this is a large effort for the existing app, then perhaps a new mini console-listener could be created that is a subset of the console (showing only log messages, no watches etc) that can be left running 24/7 and simply appends new messages to the screen and allows scrolling/searching up/down the log and perhaps filtering too.

This is a requirement we have and its been awkward trying to get the most from SI sometimes on these mission criticial 24/7 systems.

Thanks

Hugo


#4

Hello Hugo,

Thanks for the update. The memory limit situation only occurs when using the live logging features of TestRail (TCP/named pipes) and is not related to log files. We recommend using log files for production systems (with log rotation options etc.) as these have the lowest impact on your application. You can then just open the latest log files in the Console (even while your application is writing to them) or even write tools to analyze the log files and take further actions based on particular events (for example, it would be possible to write a tool that periodically checks generated log files for errors and then emails those errors to administrators). The log file format is fully documented and there’s also an SDK which can read SmartInspect log files:

http://www.gurock.com/smartinspect/smartinspect-log-formats-and-protocols.a.html
http://www.gurock.com/smartinspect/extras/

Making the Console memory independent is currently not planned unfortunately as this would affect too many parts of the application.

Regards,
Tobias


#5

[quote=tgurock]Hello Hugo,

Thanks for the update. The memory limit situation only occurs when using the live logging features of TestRail (TCP/named pipes) and is not related to log files. We recommend using log files for production systems (with log rotation options etc.) as these have the lowest impact on your application. You can then just open the latest log files in the Console (even while your application is writing to them) or even write tools to analyze the log files and take further actions based on particular events (for example, it would be possible to write a tool that periodically checks generated log files for errors and then emails those errors to administrators). The log file format is fully documented and there’s also an SDK which can read SmartInspect log files:

http://www.gurock.com/smartinspect/smartinspect-log-formats-and-protocols.a.html
http://www.gurock.com/smartinspect/extras/

Making the Console memory independent is currently not planned unfortunately as this would affect too many parts of the application.

Regards,
Tobias[/quote]

Ho Tobias;

Did you mean to write “SmartInspect” above - not “TestRail” ?

Your reply has confused me a bit - FYI here is the connection string used within the service startup logic of a network server of ours:

[color=red]“pipe(reconnect=true, reconnect.interval=5),file(filename=|” + logpath + “|,append=true,rotate=daily)”; [/color]

(Ignore the vertical bar - we replace that with a " char as a later step before using the string, just to ease readability of the C# source).

If we start the service it’s fine - the log file is updated and we see these within the file.

So basically we ALWAYS have and use a log file and these appear to always be correct.

However we often have the SI Console up when we start the system and so we see the logged messages in that console viewer - this is quite intuitive.

What I’m getting at here is that the eventual freeze of the console (after several days) is a bug - because we are forced to kill it via task manager - if there are some internal limits I’d expect the console to check this from time to time and behave more robustly, perhaps displaying a message box etc - but allowing us to exit it cleanly.

And to recap I don’t think it is ideal to allow the console to have any limit at all, whcy can’t that app be improved to run forever, just reading and scrolling log messages as they get added to the file by the applications?

I assumed early on that the console was just a viewer - able to scroll arbitrarily through any sized log file and automatically scroll as new messages are added.

Finally when it does freeze and we restart it (as described in the first post in this thread) the apps never seems to be able to reconnect - I am seeing if my customer can see anything in the connecttion log but yet to hear back from them on that front.

Might there be a bug in reconnect that only occurs when a console fills up and freezes and gets forcibly restarted?

Have you or can you possibly test this scenario?

Start console, start a logging app that logs a great deal, wait (even a few days if needed) for the console to freeze up, kill it then restart it and see if the app reconnects and log messages appear?

PS: Here is a typical console view of the service startup info:

As you can see important info is reported and these kinds of messages along with ocassional exceptions, status, metrics, connects, disconnects etc are often needed in an instant by support staff throught the day - the SI product does not state (so far as I can recall) that the console is not designed to run continuously and so I never conveyed that to my customers when we were desiging and putting these large mission critical systems together.

Thanks


#6

Hello Hugo,

Thanks for the update. Yes, I meant SmartInspect and not TestRail, of course, sorry for the confusion.

This isn’t actually a limitation of the Console, it’s just that the memory limit will be reached eventually and that the Console is no longer able to receive additional messages then. That said, it shouldn’t freeze but just show a out-of-memory message or stop displaying new messages. We tested this here and in our tests the Console just stops displaying new messages (at around 12 million simple messages in our case) but doesn’t freeze. A restart of the Console and reconnect worked fine after that.

The messages that are displayed live in the Console (as they are generated by your application) are sent via the named pipe connection and the Console doesn’t use the log file for this (this is just a second/separate log destination you’ve configured for your application). This also means that the log messages aren’t stored on disk but only in memory.

You are actually using two logging destinations: log files and live logging via the pipe protocol. For production environments with long running applications we recommend just using log files (with appropriate rotation options) and not using live logging for longer periods. So we recommend dropping the pipe log protocol in your case and using the log files.

I hope this helps and please let me know in case you have any further questions.

Regards,
Tobias


#7

Hi Hugo,

Are your using Delphi or .NET? If .NET I might have a fix for you. You can reach me at: for more info.


#8

Hello Hugo,

As an update, another option would be to periodically send a “clear log” command to the Console to free the memory (e.g., once a day at midnight). You can send such a command with the ClearLog method of the session class (e.g. SiAuto.Main.ClearLog).

tam1000: I’ve removed your email address from your posting to prevent spam, please post any code snippets directly in this forum.

I hope this helps,
Tobias