I’m working on a site for a customer, which is also hosted at 1&1 (the same company I use for my own hosting). I ran into an issue earlier in the week where I was getting an HTTP 500 (Internal Server Error) every time I would run a certain long-running script, and couldn’t figure out why. Normally, when you get a 500 error in PHP, you get it instantaneously, and it’s usually due to a malformation in a server directive in an .htaccess file (Apache error) or a php.ini file (PHP error). However, in this particular instance, I was getting the error after the script had been running for a little while.
This particular site is actually two sites that have different URLs that use the same codebase on the same web server and point to two different databases. When I say the same codebase, I don’t mean two different directories running the same code, I mean literally the same folder used for both sites. There are switches in the config code based on which URL you are using to access the code, which results in connections to different databases and a different presentation. I was able to scratch bad code off of the list of possible causes for the problem because the script ran fine on one site, but not on the other. The script takes a user-provided ZIP file containing specially formatted text files created from a database dump from the backoffice and translates that data into objects, which are passed to database functions that load the information into the database. The primary difference between the two sites is that the data files were bigger on the site that wasn’t working than on the site that was, which resulted in a longer execution time for the site that wasn’t working.
I did some significant work to increase the performance and security of the database import – converting the database class to using prepared statements instead of raw queries, freeing up objects from memory after they were no longer in use in the page display class that managed the import, that sort of thing. However, there is simply a lot going on in the import, and I can’t improve performance any further without cutting out essential elements of the import function. So, I called 1&1 support to see if they could get me some more information on the 500 error, since they are using customized ErrorDocuments which don’t give details. Apache writes these details to a log, usually located in /var/log/httpd/error_log (or similar). When I SSH into the 1&1 server, I don’t have access to /var/log/httpd, so I can’t look at the contents of this log file myself, and there is no way to look at the log through the administrative interface.
The call, dear reader, was one of the worst support calls I’ve ever been on.
I explained that I was running a script, getting a 500 error, and needed details about the error, which I wasn’t able to get on my own, or didn’t know how to get on my own. I explained that all I needed the support consultant to do was to tell me how to access the details of the error so I could go about debugging my code and fixing it. Ultimately, I figured I was running into some sort of limit – time limit, memory limit, database connection time limit, database query limit, something like that. Getting the log would let me figure out which specific limit I ran into, so I could fix the code to never hit that limit. However, without knowing which limit I was hitting, I wouldn’t know which part of the code to modify.
It took me an hour and ten minutes to get the tech to understand what I was asking of her. Now, I’m calling a web hosting company’s technical support line, so you’d think that the person on the other end of the phone would know what an HTTP 500 error was, and would be familiar with people needing to get info about what caused the error. Under normal operating conditions, PHP will throw errors when it fails – which I am already catching and logging to a file using customized set_error_handler and trigger_error code. I was getting the 500 error after about 60-90 seconds of processing time, and PHP wasn’t logging any errors, so Apache was the only source of error information. During the course of our conversation, she:
- tried to tell me that the “details” I was looking for were the type of error (“Internal Server Error”) and the description (“Please try again later.”)
- tried to update my .htaccess file for the production site to force it to use PHP5, despite the fact that the problem was happening in the DEV site, and production hasn’t been upgraded to PHP5 yet, which would likely break it – which is why we are running a new version of the site through DEV. She tried to do this without my permission, and despite my repeated explanation that the problem was happening on the DEV site.
- initially couldn’t understand why she couldn’t view the page that was throwing the error, until I was finally able to get her to understand that it was an administrative function behind a site-level login, and she would not only need credentials to log into the site but also a specially formatted ZIP file to upload in order to get the error in the first place.
- made me create a login for her and send her the file to upload, despite my insistence that I didn’t need her to verify that I was getting the error, I needed her to get me access to the details about the error.
- failed to successfully upload the ZIP file I sent her – presumably because she either didn’t save it properly or tried to upload one of the .txt files inside the ZIP file by itself.
- very demeaningly pointed me to an article in their knowledgebase about how to use PHP to log PHP errors to a file, which I pointed out that I was already doing, and which I demonstrated was not logging any errors.
- ultimately failed to get me the information I requested.
While she had me on hold for about 15 minutes “researching” my issue, I went a-Googling (and yes, Google, I’m using this term correctly, since I used Google for searching) and came up with details on the Apache directives they use to redirect users to customized HTTP error pages, and the specific location of the error logs that Apache generates. Since it’s a shared hosting setup, I don’t have root access to the box, and I knew I wouldn’t be able to view those files. During the last five minutes of the conversation, I gave her the specific path to the file I needed, and she said she’d ask the sysadmins to get it. It’s been a few days now, and I still don’t have a copy of it, and doubt I’ll get one.
Frustrated, I decided to do some testing on the script to see where it was choking out. The site architecture involves a master Page class that contains common headers and footers for all pages, and customized Page classes (as class extensions) for the different site pages. For example, the class that controlled this page was called PageRefreshDB. It is divided up into different functions that handle different files and load data into different tables, which are called by a display function that prints out what the script is doing and how long each step took in seconds. I commented out all of the function calls except for the drop and create table functions, uploaded it to the server, ran it, and it worked. I then went line by line, adding one function call at a time, and everything worked, until I uncommented the last function call, and it 500’ed on me again.
It’s important to point out that there are three environments for this website: my local machine (OSX, Apache 2, PHP 5, mySQL 5), the DEV site, and (eventually) the production site. I thoroughly test all of my code locally before uploading it to DEV – checking it on DEV is to ensure there are no differences in how the code is executed on the live server, problems caused by download delays, that sort of thing. It’s nothing but a validation run of code I know to work properly. What made this error particularly frustrating to debug – not just because each iteration of testing took between 60 and 90 seconds of waiting for the function to fail – was that it worked just fine on both test sites on my local machine, and only worked on one test site on the live server.
Okay, so everything worked fine until the last function call, so the problem had to be in the last function, right? Well, the functions can basically be grouped into three blocks:
- Drop and create refreshable tables;
- Populate the product and product-related tables;
- Populate the people-related tables (salespersons and customers).
So, if I commented out the people-related tables, everything worked fine. Since there are no dependencies between the product tables and the people tables, I commented out the block dealing with products, and left the block dealing with people, which included the last function that was giving me problems. Well, that import worked just fine too, which meant that the problem wasn’t with the last function, but was with some kind of limit I was hitting.
I’ve been writing PHP code for 10 years, so I’m not exactly new at this. I took care of all the usual “gotchas” – increased the script timeout limit to 600 seconds, checked timeout limits on mySQL connections (unlimited), checked limits on upload data (38x more than what I was sending), etc. Everything checked out. So, I tried putting in some diagnostic code to figure out exactly where the script was failing.
I ended up narrowing it down to where the script was building customer contact information by looping through a contact file. The script built the contact information for all of the customers, then passed the array of Customer objects to a database function that imported all of the contact information at once. The script was bombing out during the file loop, which meant that I wasn’t hitting any limits with mySQL. That left script timeout or memory usage. I started outputting total script time and total memory usage, trying to see if I was hitting any limits. As I said, I had optimized the code to use less memory, so memory shouldn’t have been an issue – and the memory I was using was significantly below the maximum allowed memory usage for PHP scripts as reported by phpinfo(). What I did discover was that the script would bomb out at different execution times and different memory usage values, so it wasn’t clear that I was hitting a certain ceiling and my script was being killed by PHP.
Plus, if my script was being killed by PHP rules, PHP would throw an error, which would go in my error log. That’s what was so weird about the whole thing – it was almost as if there was some sort of “sentinel” program on the server, checking for long-running PHP scripts, and killing them without a word of explanation when it found them. Well, once I got to this point, I was beyond my expertise, and did some Googling to figure out if there was some explanation for PHP scripts that randomly stop executing without a word of explanation, and if there was some way of logging or figuring out why. One suggestion was that browsers sometimes give up loading pages after a certain amount of time has passed without receiving any output data, which wasn’t the case here, since output data was being sent at the end of every function execution – and the customer load took less time than some of the function calls. Then, I stumbled on a Stack Overflow article that said this:
I think this is caused by some process monitor killing off “zombie processes” in order to allow resources for other users.
There were other users, also using 1&1, that were having the same problems with scripts being seemingly arbitrarily killed after running for a while. The user that posted the comment above was using a different host, and had access to the log files, which indicated that the process was killed instead of being allowed to finish. I have a strong suspicion that the same is true with 1&1, due to other users having similar experiences.
At this point, I will have to rewrite the code to execute in blocks, and have the page reload a few times in order to keep execution times below the threshold of the process monitor, which I expect is 60 seconds. This solution is absolutely terrible, because it means I can’t enclose my database actions in a transaction that could be rolled back in case something goes wrong. This way, if something goes wrong, the website will be stuck in mid-load, which would be disastrous for production. The only way to get out of mid-load would be to load an older export file, so I could either rely on the administrator to have one sitting around, or I could store the last successful load on the system, and have an auto-rollback feature that would process the last successful load on load failure. Using either of these methods is inferior to using a database transaction, because transactions were created for this purpose – to prevent the need for manual rollback code.
The support technician I spoke to on the phone should have known where to go for HTTP 500 error logs, she should have known about the process monitor that kills long-running scripts, and the call should have taken one minute and fifteen seconds, not one hour and fifteen minutes. As Rob Hardgood said on the Stack Overflow post:
I … am supremely pissed at 1and1.