CRITICAL...xTuple database connection continually drops sporadically. Also sporadically connects. Not consistent. CRITICAL
Hello. Hopefully someone here can offer some advice. We have just started implementing xTuple on our network and have been having nothing but issues. Our current setup is as follows:
DELL PowerEdge T110 II Tower Server - Windows Server 2012 Essentials (Running xTuple PostgreSQL database)
10 DELL Vostro Desktops - Windows 7 64-bit (Running xTuple GUI clients)
2-3 other laptops and desktops (running Windows XP 32-bit and xTuple GUI clients)
TELUS VDSL2 Modem
ASUS Dark Knight Gigabit Router
TP-LINK 16 port Gigabit Unmanaged Switch
I have setup the PG database, restored the "empty" database from (also tried quickstart and demo) from sourceforge.net. Then connected via port 5432 from a Windows box, will not connect first or second time (get "not listening on port 5432" message) and then third time will connect (Sometimes alternates between working second time). Then database connection aborts (citing either application disconnect or network failure). If trying to reconnect, will ususally reconnect right away.
I have switched out the modem, router and switch (no other hubs or switches between main switch and computers). I also tested the network with a LAN tester and found no serious wiring faults. Did replace a couple CAT5e cables, as were 1 or 2 wires were loose (not even pairs that ethernet use, but still replaced). Ran series of network test, including WireShark and Microsoft Network Manager. Saw no serious issues (high traffic volumes, routing errors, etc). Pinging tests from clients to server came with expected results (<=1ms) as well.
When connection to PG database fails, a ping timeout can be seen with a concurrently running ping * -t test. Server is NOT running any advanced settings. No domain is setup, clients running on Windows Workgroup. DNS server service on DELL server is not running (due to not being needed - no other reason). NIC settings are set to never turn off. Power management settings for server are set for performance, not ever to power off any part of the server's hardware.
I know there are other steps I took, and if anyone has any ideas or needs any other information, I woule be glad to try anything or if I forgot, tell you I already tried that. This as you can probably assume, is exceedingly frustrating and I truly hope someone here has a thought that I have not had.
I am very IT adept and will not mind how technical any ideas might be. Thank you again in advance and have a great day!
How about setting up the xTuple database on a workstation or temporary server? That would be my next troubleshooting step.
Also notice that you are running Windows Server 2012 Essentials on the server. I recently installed a new Dell server for a customer on Windows Server 2012 Essentials. Nothing but problems getting their accounting software working. Their accounting is Sager 50 with a MySQL database. I installed Windows 2008 on the server and everything worked fine.
Windows Server 2012 Essentials is no longer an option for me. Microsoft striped it down to the point that it can;t be called a server OS in my opinion
You do make a good point about Server Essentials, however even though the server software and inclusive packages may be stripped down, the NIC functionality shouldn't be affected. That is a very basic function of any server, rather any desktop. If a Windows Server 2008 must be purchased in order to run a stable network, then Microsoft has bigger issues than can be listed here :)
Also, it should be know that although I am running Server 2012 Essentials currently, we started down this rabbit hole with Ubuntu 12.04 LTS (Precise Pangolin) and then switched to Redhat Ent Server. No joy on either occassion. It seems to me that the issue is centered around the actual tower, which of course the Mfr is avoiding entirely. The BCOM NIC that the box currently has could be the issue but no amount of troubleshooting is able to pin point the issue.
The other interesting thing is that Alfresco ECM works perfectly fine, which also runs on a PostgreSQL 8.4 server. Listening on port 5433 with no problems.
As well, with you other idea of using a desktop client to act as a "pg server", it is an excellent idea, one that I have already implemented in order to set up xTuple, however the goal is to not have it running on a desktop but a server to avoid user interfacing as much as possible. As for the processing power required, xTuple and PG require very very little. I am trying to decide currently if there is any real massive downside to running xTuple on a desktop. Scalabililty? Maybe but the concept of running virtualizations in this office is a little down the road.
Anyways, I am still plugging away fruitlessly at this; we strongly want to go with xTuple which includes paid and unpaid options, but these technical difficulties are making it completely unfeasable. At the risk of sounding like a broken record, ANY AND ALL HELP AND/OR SUGGESTIONS ARE GREATLY GREATLY APPRECIATED!!!!
I should point out that the workstation running xTuple runs perfectly.
I have seen on newer machines the Ethernet is set to use gigabyte by default, and when connecting to older ISP they had intermittent connections due to the routers @ ISP choking on the gigabyte Ethernet. Your problem may be something on the network it passes through can't handle that. You can actually set options on that card via dos to disable that. Been a while but that might explain why a PING or trace doesn't pick it up, only throttling a lot of data across it would cause the router to drop the connection.
In odd cases like this you have to play "one of these things is not like the other" or you will just chase it. If you had a slot and an extra Ethernet card laying around ( a 10 MB?) I would try that, if it works on the work station, what is different?
Tom
One of our guys had a suggestion. You might want to try to disable your TCP checksum offloading gin the properties of your Ethernet adapter and see if that doesn't help.
Tom
Is this server's warranty expired, or is there some other reason that Dell doesn't want to service it? I think, given the fact that you have issues where sometimes PING doesn't even work, you should be able to get them to fix it if it's under warranty.
Regarding the Alfresco: Is the Alfresco server running on the same box, or on a different box? If the Alfresco is running on the same box, then it kind of makes sense; if Alfresco is talking to Postgres through the OS "localhost" interface, then any problem with the hardware eth0 would not show itself in the Alfresco <-> Postgres traffic.
When you had tried Ubuntu and RHEL, and those didn't work, was it the same problem as you are experiencing now?
Check your postgres config files, what do you have in pg_hba.conf? that is where you setup Postgres to be accessed from the network.
host all all 127.0.0.1/32 trust
host all all :0/128 md5
host all all 0.0.0.0/0 trust
If anything shuts off the connection to the server you'll have issues of this sort - so my first start would be some basics - like the power saver settings where network cards get shut down etc.
As Tom points out, it would be very helpful to see the text of your pg_hba.conf file on the server - that's an easy place to make a typo and have the "Server not listening" type response.
Lastly - you did use a fixed IP address on your server right, and you're attaching to the server via workstation on that IP address, and not via some DNS name?
I doubt your hardware itself would have this many faults being all new.
Hello. As I mentioned earlier, all of Power Management settings have been altered on the Serveer . Set custom with high performance settings also including never shutting off the NIC, going to sleep or hibernating, etc. The pga_hba.conf contents file I have listed above, but look right to me so far. Maybe I have missed something. And I did neglect to mention that yes all clients and server are using DHCP reserved IP addresses (equivalent to static IP addresses) and are assigned by the ASUS router. (ie. The server will aways run on 192.168.1.8 with a windows box always running 192.168.1.5 or some IP). And you are absolutely right. I have effectively ruled out hardware eing the issue.
Also, I should say the the NIC the server is using is a Broadcom Netextreme Gigabit Controller, which I have updated both the drivers and the firmware.
Thank you very much for the input, hopefully we can figure something out!
Have you tried looking at the event viewer on the server to see if there are any reported issues? The driver for the NIC might be able to report problems with the network.
This doesn't seem like a postgresql conf file issue; if that were the case, it just wouldn't work. But you report that it works sometimes.
Do you have an ethernet crossover cable? You might try connecting a laptop directly to the server's NIC via crossover cable to see if you get the same results. I have seen what you are describing in networks where the switch is flaky; eventually some number of ports on the switch just stop working. I think there is a big clue in the fact that PINGs either work or timeout, depending on if you can also connect to xtuple.
host all all 0.0.0.0/0 trust
You're aware I hope that setting the last item to "trust" means you've pretty much turned off security - it'll take any password..
I forgot to ask - usually the windows installer would be from Enterprisedb - but they're currently releasing Postgresql 9.2 - you did remeber to get the 9.1 installer right?
And you didn't have any residual "test" installs of Xtuple / posgresql on that server?
Yeah, the last setting is to accept all passwords. I was simply ruling out security issues as for the intial connection problems. I plan on rmoving that line. But the 127.0.0.0/32 should be the one accepting connection with password and it should work from what I know of PG.
The installed PostgreSQL currently installed is 8.4. Was set up via Command Line but managed with pgAdmin 1.10.5.
And no other database installs on server. Drive was formatted, server OS installed and PG installed.
And the Xtuple GUI running is 3.5.4.
Are your servers, switches, and workstations all plugged into battery backups? Network dropout can be caused by inconsistent line voltage.
From a troubleshooting standpoint, I'd shut everything down except the server, switch, and one workstation. See if the problem persists. Bring things online one at a time and see if one particular station is causing a fault somehow.
That is an interesting point. To troubleshoot, I ran CPU-Z on the server to check VID as well as on one of the Windows boxes. Interestingly enough, the server, rated at 3.10 GHz is fluctuating considerably between 3.100 MHz and 3.5MHz. Now this is slightly outside my field of expertise but it seems to me that this is a significatn jump. Why would the server be overclocking? Would this affect NIC connectivity? Bus speed seems to be remining the same throughout. I am unfamiliar with the tolerance ranges of regulators on motherboards. Would this cause an issue? It would never be noticed on pinging test, network analyzers or we browser connections as those are normally cached but a database connection requiring constant connectivity to maintain a connection could possibly be affected by this type of behavior.
In any case, using a UPS wouldn't solve the issue as it only creates a backup for power failures but doesn't serve to actually regulate the output of the electrical power to the client users on the UPS. Not any affordable options I have heard of anyways. Thanks again for your help. But, regardless, I did put on a UPS just now and it is not seeming to help the issue or the CPU-Z stats.
Thank you again for your advice. I will keep working on it. Hopefully we can come up with something. I can send you any data you require about the configuration of the systems here.
You can also give a look at your Postgresql.conf file and make sure you are listening. Also try to connect from the server as well as clients installed on the network. Make sure your firewall is also not blocking this traffic as well. If I still had problems, you could also change the port number is the postgresql.conf to see if anything else is using that port, should not be but stranger things have happened, also make sure you only have one version of Postgres on a particular port number, if you have an old 8.4 version or some other Postgres install check the port it is configured for. I like to change the port number, just to avoid that type of thing.
Tom
Thanks for the advice. Postgresql.conf file says "*" for listening to all clients and 5432 for port number. I will try to change port numbers to eliminate that possibility. I have checked firewalls both on the server, the client boxes and the router. There is also no other version of postrges running on the server. Like I said, fresh install. Sorry I don't mean to shoot everything down, and thank you so much for your help. I can post any files you might want to see too if you would like.
Sounds like you got a handle on it. Other possibilities would include collisions on the network. I have added new hubs before to find they don't play well together and had those kinds of problems on servers I had never had problems with. From the clients you can also run stuff like tracert from windows machines to see how you get to the server. On the same network they should hit fast.
Tom
Thanks for the pointer. tracert came back <1ms round trip, no problems there. But yet still dropping. Also had bought all brand new switches and router (TP-Link switch and ASUS Dark Knight Router). No known conflicts, but did (for troubleshooting) replace the tp-link with a dlink and then a cisco, still no joy. I thought collisions might be the cause but Wireshark didnt pick up anything like that. As I said before, the only really strange thing is how the server's cpu is acting like it has been overclocked, running at 3-400 MHx higher than it's rating, as will as the core's VID alternating significantly. I have never heard of this causing any NIC issues but it is the only atypical thing I have found through testing so far.







