| View previous topic :: View next topic |
| Author |
Message |
Shashank Guest
|
Posted: Mon Nov 03, 2008 11:44 pm Post subject: distributed measurement problem |
|
|
Hi,
I am working on a distributed measurement project with a centralized
data collection node (server) and 28 clients with different number of
interfaces(1-4).
I've written C code that captures packets on all the interfaces on a
node(on which it runs), gets statistics(pps, Mbps etc for different
subsets of traffic), and sends it to the server every second. The
server basically creates a file for each interface on each client and
writes these statistics into the respective files.
I've used python to automate and synchronize, so it basically runs the
C program in the background on each of the interfaces.
The problem is:
If I initiate the client program to run for, say 200 seconds, the
clients run for the entire period sending statistics per second to the
server. However, files corresponding to some interfaces do not show
the entire 200 seconds even though the client finishes execution and
the server closes the file after the client has finished execution.
I don't think this is an issue with the server being flooded with data
(its multithreaded and the below example was run one node at a time)
or about packets being dropped(doesn't make sense for this problem
plus ifconfig doesnt show dropped packets and I am using TCP sockets
as well). I am not sure whether there is a bug in my code, since its
essentially the same client code on all systems.
Here is the wc -l execution on three nodes run one at a time for 200
seconds:
| Quote: | wc -l *.log
44 core1.10.1.11.2.log |
200 core1.10.1.3.2.log
49 core1.10.1.32.3.log
200 core1.10.1.9.2.log
49 core2.10.1.13.2.log
49 core2.10.1.15.2.log
200 core2.10.1.3.3.log
200 core2.10.1.5.2.log
49 core3.10.1.17.2.log
200 core3.10.1.18.2.log
200 core3.10.1.30.3.log
200 core3.10.1.5.3.log
1640 total
Each has 4 interfaces on it, and although the experiment ran for 200
seconds, some show about 44 or 49 lines on it. ifconfig on the server
shows no dropped packets.
Does anyone have pointers on this?
Sorry for the long post,
Thanks,
Shashank |
|
| |
|
Back to top |
David Schwartz Guest
|
Posted: Tue Nov 04, 2008 2:21 am Post subject: Re: distributed measurement problem |
|
|
On Nov 3, 3:44 pm, Shashank <shashank.shanb...@gmail.com> wrote:
| Quote: | The problem is:
If I initiate the client program to run for, say 200 seconds, the
clients run for the entire period sending statistics per second to the
server. However, files corresponding to some interfaces do not show
the entire 200 seconds even though the client finishes execution and
the server closes the file after the client has finished execution.
|
This doesn't fit the pattern for any "typical mistake" that I'm
familiar with. I'd suggest trying to localize the problem bit by bit.
For example, first modify the client software to checkpoint how many
reports it has sent to the server. Have a client log file, and have it
write a 'checkpoint' after every ten messages. Open the log file in
append mode, assemble the checkpoint message in a buffer, and send it
with a single call to 'write'. If the checkpoints don't show the 200
messages, then you know the client is the issue.
Then add similar checkpointing in the software that talks to the
client. Make sure the server software sees 200 messages. If not, then
you know something is screwy in that piece of software. (Perhaps the
client isn't really sending the messages? Perhaps the server is
dropping some of them?)
Keep going until you localize the problem.
DS |
|
| |
|
Back to top |
Joe Beanfish Guest
|
Posted: Wed Nov 05, 2008 12:21 am Post subject: Re: distributed measurement problem |
|
|
David Schwartz wrote:
| Quote: | On Nov 3, 3:44 pm, Shashank <shashank.shanb...@gmail.com> wrote:
The problem is:
If I initiate the client program to run for, say 200 seconds, the
clients run for the entire period sending statistics per second to the
server. However, files corresponding to some interfaces do not show
the entire 200 seconds even though the client finishes execution and
the server closes the file after the client has finished execution.
This doesn't fit the pattern for any "typical mistake" that I'm
familiar with. I'd suggest trying to localize the problem bit by bit.
For example, first modify the client software to checkpoint how many
reports it has sent to the server. Have a client log file, and have it
write a 'checkpoint' after every ten messages. Open the log file in
append mode, assemble the checkpoint message in a buffer, and send it
with a single call to 'write'. If the checkpoints don't show the 200
messages, then you know the client is the issue.
Then add similar checkpointing in the software that talks to the
client. Make sure the server software sees 200 messages. If not, then
you know something is screwy in that piece of software. (Perhaps the
client isn't really sending the messages? Perhaps the server is
dropping some of them?)
Keep going until you localize the problem.
DS
|
Also timestamp your messages and look to see which ones are missing.
That may give you a clue of where to look for the problem. |
|
| |
|
Back to top |
Shashank Guest
|
Posted: Tue Nov 11, 2008 8:34 am Post subject: Re: distributed measurement problem |
|
|
On Nov 4, 1:21 pm, Joe Beanfish <j...@nospam.duh> wrote:
| Quote: | David Schwartz wrote:
On Nov 3, 3:44 pm, Shashank <shashank.shanb...@gmail.com> wrote:
The problem is:
If I initiate the client program to run for, say 200 seconds, the
clients run for the entire period sending statistics per second to the
server. However, files corresponding to some interfaces do not show
the entire 200 seconds even though the client finishes execution and
the server closes the file after the client has finished execution.
This doesn't fit the pattern for any "typical mistake" that I'm
familiar with. I'd suggest trying to localize the problem bit by bit.
For example, first modify the client software to checkpoint how many
reports it has sent to the server. Have a client log file, and have it
write a 'checkpoint' after every ten messages. Open the log file in
append mode, assemble the checkpoint message in a buffer, and send it
with a single call to 'write'. If the checkpoints don't show the 200
messages, then you know the client is the issue.
Then add similar checkpointing in the software that talks to the
client. Make sure the server software sees 200 messages. If not, then
you know something is screwy in that piece of software. (Perhaps the
client isn't really sending the messages? Perhaps the server is
dropping some of them?)
Keep going until you localize the problem.
DS
Also timestamp your messages and look to see which ones are missing.
That may give you a clue of where to look for the problem.
|
Hello,
Thanks to both of you for the suggestions.
The problem was actually in one of the anomaly detection algorithms I
was using.
I have sorted the problem out.
Thanks..
Shashank |
|
| |
|
Back to top |
|