n3eg New Member
 member is offline
.
Joined: Mar 2004 Gender: Male  Posts: 12 Location: Longview, WA Karma: 0 |  | CWOP weather intermittancy « Thread Started on Feb 8, 2008, 7:52pm » | |
Here we go again...I haven't been able to get rotate.aprs.net, first.aprs.net, and third.aprs.net to work in the last few days.
From the APRS SIG...
Allow me to jump in here and give an explanation of what is occurring here. This will be a long message, and I apologize for that. For others aside from you that will reply to this, I will only continue to be a part of the discussion if the discussion is civilized, otherwise I will just unsubscribe from the list. If any of you would like to discuss this further with me in private e-mail, you are more than welcome to, my address is dave at aprsfl.net.
This requires a slight bit of history, for this to be understood completely. The CWOP program (citizen weather observer program) runs "on" top of the APRS-IS network. While licensed ham radio operators are a portion of the CWOP participants, the lion share of the participants are not hams, that participate only via the internet. For years, this hasn't been a problem, aside from some of you who've voiced concerns about the "sea of blue" on your map with all the CW stations.
Up until last year, the tier2 network had been supporting the CWOP program. They started having stability issues with their systems, and were crashing enough that the CWOP users were complaining a lot in the forums. Many were told to use the core, knowing that the systems that ran the core were considerably larger servers, not home level PC's. About this same time, (these facts are not disputable, I'll pass on the letter Phil sent to Russ for any who doubt my message here) Phil with tier2 sent a message to the head of the CWOP program Russ Chadwick, simply stating that tier2 was pulling support, that it was better for everyone involved. This was sent late on a Friday.
We (the core sysops) found out about this shortly afterwards and jumped into action. Some of us who run core servers added additional memory to servers, and in the case of myself, I dropped $4500 for a new 8 core box to run one of the core servers, moving my older dual CPU xeon server to a co-lo I have access to in Texas increasing the core server count from 3 to 4. We knew that handling 3500 CPU stations with this much horsepower was just not a problem.
Shortly after the CWOP users started connecting to the core, I particularly being a network admin of a server farm with over 100 systems in it, noticed several MASSIVE shortcomings in the software these guys use that I knew even back then was eventually going to become a problem.
I decided to write a whitepaper with the help of the server developer, other sysops, and the CWOP management to assist developers in writing code that would be network friendly, as in the past, no such guidance had been given to these developers. I'll outline the major flaws here that all needed addressed immediately to preserve APRS-IS network reliability and server availability to the CWOP members:
* CWOP software was only doing a lookup on a server name in most cased the first time you loaded the software. That meant if the IP changed of the server, or in the case of rotate.aprs.net, a round robin A record in DNS for pseudo load balancing was used, that the user would be "stuck" on one server will they reloaded the software. This problem represented some 18 of the CWOP apps, including the largest by Davis.
* CWOP software was allowing users to set their polling interval down to 1 minute.... Considering these folks only purpose was to get weather information to MADIS via the Findu funnel, anything more than every 15 minutes, the interval Findu passes weather data upstream to the NWS, was not needed. Steve had prepped Findu for a 5 minute interval after a bad windstorm went up the Chesapeake, however, the folks at MADIS apparently didn't want the data faster than that, citing they did not have the processing power to handle quality-control of the data more than every 15 minutes. So with this, a recommendation was made to hard code CWOP only software to never be able to set the interval faster than every 5, as we had well over 400 of them sending data at 1 minute intervals.
* CWOP software, and this is the BIGGEST problem, was using the local computer's CLOCK for the polling interval. Take UI-View, if you set a 5 minute interval, its interval is based on the load time of the program. As hams, we've known for years that if we all beaconed based on an exact time interval on a clock, that the network could not support this. Well, CWOP was never told this, so some of the worst possible network programming occurred, and these software packages if set for every 5 minutes (the default of most) sent their data at the top of the hour, 5 past, 10 past, etc. Keep in mind at the time there were 3500 stations!!! (now 4500 of them)
* Finally, knowing the network was growing, it was decided that CWOP members should use cwop.aprs.net as a hostname we created that at first would mirror rotate.aprs.net, but later when the two needed separated into separate networks, it was as simple as changing a DNS record for the non-hams to be moved to a separate network. (forward thinking)
Well these changes were proposed to the CWOP management. A gent by the name of Dave Helms was the point man to the 23 some developers of CWOP software. He immediately balked at the changes. Citing he had just gone thru having all of them move to rotate.aprs.net, and this was too much to insist they change and that developers would pull support from CWOP if they knew these many changes were needed. He further went on to say these "changes" needed "tested" first (eh?). Since most of these changes are simply good software coding, most of which every ham piece of software already did, I couldn't see the problem. He finally after much discussion said he'd never pass this on to the developers, so I gave up trying to complete this document after the third draft of it was done.
Jump to December of 07. By now we're at 4000 CWOP stations, all following poorly written code, and all of a sudden Christmas hits. An additional almost 500 new stations in the period of a week joined CWOP from gifts.
Any of you running a APRS-IS server if you watch the CPU load of the task running the server, you'll see on the "5's" your CPU load spikes, and so does the BPS and PPS of your server. That's the CWOP users dropping off their packets. Some simple math was done, CWOP stations represent a small 1/8 of the APRS objects, however, represent a full 1/4 of the bandwidth.
All of a sudden the core servers started running into some odd issues. We had servers with input queue's stacking up into the hundreds of seconds, and servers that simply would not even stay running without crashing or locking up the system at 100%. Now, it's January, and Pete Loveall, who also has a day job, is releasing private builds of server code left and right to help the core network out of this problem. He finally released what was just released a few days ago with as many optimizations as could be thought of. He managed to reasonably get servers from crashing, and keep the queues under 30 seconds, but we're far from having the problem under control as testing shows we have loss of packets still occurring.
What's worse is that this is not just affecting the core. Let me explain why. Any server that takes a packet, say APRSfl.net (any tier2 server, etc), with 150 users (all filtered) gets a CWOP packet. It takes that -one- packet and has to multiply it by 150, and then run it thru several hash tables, and finally the write thread with the filter command will either send or not send that packet to that user. So even though APRSfl takes only 30 or so CWOP stations (based on log history), it still goes from running 2-3% cpu load on a dual cpu xeon box to 20-25% on the 5's of the clock. All again, due to CWOP software sending their reports in using the computers clock for the polling time.
This onslaught of traffic is a massive spike to the APRS-IS network every 5 minutes. I don't care who you are, if you take a full feed, you'll feel the pinch of this. During the "drive by 5's" as I've affectionately started calling them. The APRS-IS servers start dropping packets. Not 100% sure why, aside from the fact that ever core server, even fourth with it's 8 core system, buries the CPU during those 5 minute interval for 30-45 seconds.
Come to last week. This had not been posted here as we (Greg, Gerry, Pete and myself) were working our tails off trying to come up with a solution for this. After Dave once again refused to help out here, I took a rather draconian step to prove a point, and had Steve remove third and fourth (my two servers) out of rotate. Well it definitely brought the problem to light, and in the weather quality forum, many developers contacted me directly, and most wanted copies of this draft whitepaper once they found out it existed. I explained why it hadn't been completed, and Dave Helms decided to publically accuse me of being the reason why it was not completed. Dave was concerned about developers pulling support, when he should have been worried about the network doing the same. Well, I went off the deep end, and pulled my support from CWOP completely at that point. It, apparently wasn't enough that the NWS had a pool of volunteer sysops running this network that dump thousands of dollars worth of bandwidth and hardware and never get reimbursed a penny for doing so, but now, it was my fault that this whitepaper wasn't done? I guess he forgot that it was he that refused to publish it.
What was worse is that he publically said that if non-hams had to reduce their polling interval that hams had to do the same. Excuse me? CWOP users are guests on our network. APRS-IS is built by, ran by, and exists to support hams. Non hams' using it are nothing more than guests that use us as a way to get their weather reports to the NWS. That offended many hams, I was at over 200 private e-mails yesterday of hams pulling CWOP support for that callous attitude that Dave had.
Gerry of first said it's time to get the CWOP users off of the APRS-IS, and that he would get a non-ham set of servers up and running post haste. That's a great solution, and definitely gets their traffic off of our network, but a full 1/5 of the CWOP users still use tier2 when they CWOP management has been contacting users for a year to move over, so even if a separate set of servers exists, it'll take forever to get them off our network given the CWOP management have said they refuse to contact users to switch to a separate set of servers.
So while this is going on, now I was being blamed for this problem, thingy from Tier2 decided to go start the tier2 vs core debate in the qc forum again. Telling folks there is no need to change their polling intervals, that nothing is wrong with the software, and to come back to the tier2 servers where all will be well.
thingy has zero network background and had no idea just how detrimental CWOP has become to the APRS-IS. He doesn't realize that a full 3 out of 10 packets that flow thru the core right -now- are not making it end to end on these 5 minute polling intervals. I it will not make ANY difference where these stations -enter- the network, the load is the fact that in the case of the core, a load of 50 packets per second worth of CWOP stations turns into 10,000 packets per second by the time it's been processed for all the stations connecting to a core server. I doesn't, again, matter -where- the data comes from, since all data flows THRU the core, it's a problem. CWOP simply needs to be separated off our network, since they clearly will not write software that will not detrimentally affect our network, and the CWOP management views hams as more of the problem here.
So with the whole tier2 vs core issue now having come back to light, the fact that I alone had to stick my neck out to bring a problem to light, I've reached a point where I have decided to exit the APRS-IS core network.
Clearly in the eyes of some 25+ years of networking experience isn't worth anything. The fact I have $9000 worth of servers and about 10MB/sec of bandwidth 24HR a day that I've donated to the cause wasn't good enough either. Unlike some who do this with spare gear laying around and use academic bandwidth at zero cost of their pocket, I run a commercial business and this -had- a fixed cost to me. I had no problem doing this!!! I was glad to, but when I started being blamed for things I had no control over, that was it.
So after years of abuse, no appreciation, and finally being accused of creating a problem I was trying to fix, I have informed the core server sysops that I will be shutting off third and fourth come 03/01/08 midnight EST.
I take with me, when I exit a full 1/2 of the APRS-IS core server capacity. I know this will be painful to all APRS-IS users. I only hope there are some others who are willing to step up and run a core server (even though I have to warn them it is a zero appreciated job that will take years off your life from mental abuse).
I advise the APRS community NOW that future core sysops should not have ANY affiliation with the CWOP program, or the drastic measure that will ultimately need taken to filter and block CWOP 100% out of our network, will never happen. I personally think this will be the only solution as well, given current events. If we want stability to come back to our network, this is right now the only option.
I'm sorry I have to leave, and I'm sorry to the other core sysops and Pete for having taken this public, but this dirty laundry needs aired. The entire APRS community needs to be aware of how much damage CWOP is causing to our network.
As I mentioned earlier, I will not participate in this dialog further if this turns into a flame war, but I will offer my advice, help and assistance otherwise. Again, you can contact me privately at dave at aprsfl.net for more information or a copy of the draft CWOP white paper.
Regards, Dave Anderson KG4YZY
| |
|