1&1 Rant

I’m working on a site for a customer, which is also hosted at 1&1 (the same company I use for my own hosting). I ran into an issue earlier in the week where I was getting an HTTP 500 (Internal Server Error) every time I would run a certain long-running script, and couldn’t figure out why. Normally, when you get a 500 error in PHP, you get it instantaneously, and it’s usually due to a malformation in a server directive in an .htaccess file (Apache error) or a php.ini file (PHP error). However, in this particular instance, I was getting the error after the script had been running for a little while.

This particular site is actually two sites that have different URLs that use the same codebase on the same web server and point to two different databases. When I say the same codebase, I don’t mean two different directories running the same code, I mean literally the same folder used for both sites. There are switches in the config code based on which URL you are using to access the code, which results in connections to different databases and a different presentation. I was able to scratch bad code off of the list of possible causes for the problem because the script ran fine on one site, but not on the other. The script takes a user-provided ZIP file containing specially formatted text files created from a database dump from the backoffice and translates that data into objects, which are passed to database functions that load the information into the database. The primary difference between the two sites is that the data files were bigger on the site that wasn’t working than on the site that was, which resulted in a longer execution time for the site that wasn’t working.

I did some significant work to increase the performance and security of the database import – converting the database class to using prepared statements instead of raw queries, freeing up objects from memory after they were no longer in use in the page display class that managed the import, that sort of thing. However, there is simply a lot going on in the import, and I can’t improve performance any further without cutting out essential elements of the import function. So, I called 1&1 support to see if they could get me some more information on the 500 error, since they are using customized ErrorDocuments which don’t give details. Apache writes these details to a log, usually located in /var/log/httpd/error_log (or similar). When I SSH into the 1&1 server, I don’t have access to /var/log/httpd, so I can’t look at the contents of this log file myself, and there is no way to look at the log through the administrative interface.

The call, dear reader, was one of the worst support calls I’ve ever been on.

I explained that I was running a script, getting a 500 error, and needed details about the error, which I wasn’t able to get on my own, or didn’t know how to get on my own. I explained that all I needed the support consultant to do was to tell me how to access the details of the error so I could go about debugging my code and fixing it. Ultimately, I figured I was running into some sort of limit – time limit, memory limit, database connection time limit, database query limit, something like that. Getting the log would let me figure out which specific limit I ran into, so I could fix the code to never hit that limit. However, without knowing which limit I was hitting, I wouldn’t know which part of the code to modify.

It took me an hour and ten minutes to get the tech to understand what I was asking of her. Now, I’m calling a web hosting company’s technical support line, so you’d think that the person on the other end of the phone would know what an HTTP 500 error was, and would be familiar with people needing to get info about what caused the error. Under normal operating conditions, PHP will throw errors when it fails – which I am already catching and logging to a file using customized set_error_handler and trigger_error code. I was getting the 500 error after about 60-90 seconds of processing time, and PHP wasn’t logging any errors, so Apache was the only source of error information. During the course of our conversation, she:

  • tried to tell me that the “details” I was looking for were the type of error (“Internal Server Error”) and the description (“Please try again later.”)
  • tried to update my .htaccess file for the production site to force it to use PHP5, despite the fact that the problem was happening in the DEV site, and production hasn’t been upgraded to PHP5 yet, which would likely break it – which is why we are running a new version of the site through DEV. She tried to do this without my permission, and despite my repeated explanation that the problem was happening on the DEV site.
  • initially couldn’t understand why she couldn’t view the page that was throwing the error, until I was finally able to get her to understand that it was an administrative function behind a site-level login, and she would not only need credentials to log into the site but also a specially formatted ZIP file to upload in order to get the error in the first place.
  • made me create a login for her and send her the file to upload, despite my insistence that I didn’t need her to verify that I was getting the error, I needed her to get me access to the details about the error.
  • failed to successfully upload the ZIP file I sent her – presumably because she either didn’t save it properly or tried to upload one of the .txt files inside the ZIP file by itself.
  • very demeaningly pointed me to an article in their knowledgebase about how to use PHP to log PHP errors to a file, which I pointed out that I was already doing, and which I demonstrated was not logging any errors.
  • ultimately failed to get me the information I requested.

While she had me on hold for about 15 minutes “researching” my issue, I went a-Googling (and yes, Google, I’m using this term correctly, since I used Google for searching) and came up with details on the Apache directives they use to redirect users to customized HTTP error pages, and the specific location of the error logs that Apache generates. Since it’s a shared hosting setup, I don’t have root access to the box, and I knew I wouldn’t be able to view those files. During the last five minutes of the conversation, I gave her the specific path to the file I needed, and she said she’d ask the sysadmins to get it. It’s been a few days now, and I still don’t have a copy of it, and doubt I’ll get one.

Frustrated, I decided to do some testing on the script to see where it was choking out. The site architecture involves a master Page class that contains common headers and footers for all pages, and customized Page classes (as class extensions) for the different site pages. For example, the class that controlled this page was called PageRefreshDB. It is divided up into different functions that handle different files and load data into different tables, which are called by a display function that prints out what the script is doing and how long each step took in seconds. I commented out all of the function calls except for the drop and create table functions, uploaded it to the server, ran it, and it worked. I then went line by line, adding one function call at a time, and everything worked, until I uncommented the last function call, and it 500’ed on me again.

It’s important to point out that there are three environments for this website: my local machine (OSX, Apache 2, PHP 5, mySQL 5), the DEV site, and (eventually) the production site. I thoroughly test all of my code locally before uploading it to DEV – checking it on DEV is to ensure there are no differences in how the code is executed on the live server, problems caused by download delays, that sort of thing. It’s nothing but a validation run of code I know to work properly. What made this error particularly frustrating to debug – not just because each iteration of testing took between 60 and 90 seconds of waiting for the function to fail – was that it worked just fine on both test sites on my local machine, and only worked on one test site on the live server.

Okay, so everything worked fine until the last function call, so the problem had to be in the last function, right? Well, the functions can basically be grouped into three blocks:

  1. Drop and create refreshable tables;
  2. Populate the product and product-related tables;
  3. Populate the people-related tables (salespersons and customers).

So, if I commented out the people-related tables, everything worked fine. Since there are no dependencies between the product tables and the people tables, I commented out the block dealing with products, and left the block dealing with people, which included the last function that was giving me problems. Well, that import worked just fine too, which meant that the problem wasn’t with the last function, but was with some kind of limit I was hitting.

I’ve been writing PHP code for 10 years, so I’m not exactly new at this. I took care of all the usual “gotchas” – increased the script timeout limit to 600 seconds, checked timeout limits on mySQL connections (unlimited), checked limits on upload data (38x more than what I was sending), etc. Everything checked out. So, I tried putting in some diagnostic code to figure out exactly where the script was failing.

I ended up narrowing it down to where the script was building customer contact information by looping through a contact file. The script built the contact information for all of the customers, then passed the array of Customer objects to a database function that imported all of the contact information at once. The script was bombing out during the file loop, which meant that I wasn’t hitting any limits with mySQL. That left script timeout or memory usage. I started outputting total script time and total memory usage, trying to see if I was hitting any limits. As I said, I had optimized the code to use less memory, so memory shouldn’t have been an issue – and the memory I was using was significantly below the maximum allowed memory usage for PHP scripts as reported by phpinfo(). What I did discover was that the script would bomb out at different execution times and different memory usage values, so it wasn’t clear that I was hitting a certain ceiling and my script was being killed by PHP.

Plus, if my script was being killed by PHP rules, PHP would throw an error, which would go in my error log. That’s what was so weird about the whole thing – it was almost as if there was some sort of “sentinel” program on the server, checking for long-running PHP scripts, and killing them without a word of explanation when it found them. Well, once I got to this point, I was beyond my expertise, and did some Googling to figure out if there was some explanation for PHP scripts that randomly stop executing without a word of explanation, and if there was some way of logging or figuring out why. One suggestion was that browsers sometimes give up loading pages after a certain amount of time has passed without receiving any output data, which wasn’t the case here, since output data was being sent at the end of every function execution – and the customer load took less time than some of the function calls. Then, I stumbled on a Stack Overflow article that said this:

I think this is caused by some process monitor killing off “zombie processes” in order to allow resources for other users.

There were other users, also using 1&1, that were having the same problems with scripts being seemingly arbitrarily killed after running for a while. The user that posted the comment above was using a different host, and had access to the log files, which indicated that the process was killed instead of being allowed to finish. I have a strong suspicion that the same is true with 1&1, due to other users having similar experiences.

At this point, I will have to rewrite the code to execute in blocks, and have the page reload a few times in order to keep execution times below the threshold of the process monitor, which I expect is 60 seconds. This solution is absolutely terrible, because it means I can’t enclose my database actions in a transaction that could be rolled back in case something goes wrong. This way, if something goes wrong, the website will be stuck in mid-load, which would be disastrous for production. The only way to get out of mid-load would be to load an older export file, so I could either rely on the administrator to have one sitting around, or I could store the last successful load on the system, and have an auto-rollback feature that would process the last successful load on load failure. Using either of these methods is inferior to using a database transaction, because transactions were created for this purpose – to prevent the need for manual rollback code.

The support technician I spoke to on the phone should have known where to go for HTTP 500 error logs, she should have known about the process monitor that kills long-running scripts, and the call should have taken one minute and fifteen seconds, not one hour and fifteen minutes. As Rob Hardgood said on the Stack Overflow post:

I … am supremely pissed at 1and1.

New Website, Work in Progress

After switching my family website (http://www.kelseyandkevin.com) to a WordPress site, I decided that I wanted to make this site a WordPress site also. The release of WordPress 3 got me motivated to make the change. Converting the site to WordPress will allow me to keep a blog here, and allow me to manage my pages more easily.

I upgraded the K&K site to WordPress using the automatic upgrader, which worked like a charm. I have yet to play around with the new features of WordPress 3 in any significant way. This blog will end up being about technology (mostly web development on the LAMP stack, HTML, CSS, standards issues, that sort of thing) and my academic work in Science and Technology Studies (STS), particularly my dissertation focus (website design processes and the impact on users with disabilities).

As of writing, I haven’t done anything with the WordPress theme, so you’re getting plain vanilla right now. The K&K site has a fully customized theme, which I will end up employing here at some point, when I get a moment to sit down and write it.

Gallery 3 Rant

The architecture of the K&K site is pretty simple – WordPress 3 as a CMS for blogging, page administration, and theme presentation, and Gallery 3 for photos and (soon) video. At the moment, I’m just loading Gallery 3 with a custom theme in an iframe on a photos page. I experimented with direct integration with WordPress (nightmare) and custom themes for Gallery (basically impossible due to the way their CSS loads).

Overall, I’m pretty happy with Gallery 3. It’s a huge improvement over Gallery 2, which I was using before, and which was virtually unusable for what I wanted (upload photos in batch retaining uncompressed full size images). Given that G3 is in RC1 stage, I was expecting some problems with the software. I was also expecting that the developers would be willing to work through those problems if they arose. So, when I discovered that entering quotes in any field led to escaped quotes (‘ or “) being inserted into the database, I opened a trouble ticket on their SourceForge page, and provided helpful instructions on how to fix the problem. The response that I got back was that they wouldn’t fix the problem, because

magic quotes have been depreciated as of PHP 5.3 so we won’t be putting any code in gallery3 to support it.

Okay, yes, that’s true, and no one who has control over their server should have PHP 5 running with magic_quotes_gpc on. However, not all of us have control over our servers, PHP 5.3 is pretty new (not even a year old as of this posting), so many hosts haven’t upgraded to it yet, and the average user isn’t going to know jack about magic_quotes_gpc or how to turn it off. Being able to write PHP code isn’t a prerequisite for using Gallery, because it is created to be as easy to use as possible.

Gallery 3 does include an .htaccess file with instructions to Apache to tell PHP to turn magic_quotes_gpc off, but if Apache-level PHP config is disabled by the server (such as when it is being run as CGI, as in the case of my server), this method doesn’t work. You’d think, at this point, that the Gallery team would include a bit of code to check if magic_quotes_gpc is actually off, and if not, either:

  1. run stripslashes() on global variables to fix the problem, or at the VERY least
  2. inform the user that they need to take steps to turn magic_quotes_gpc off.

Since they do neither of these, the user will end up inserting slashes into their database, and this is somehow the user’s “fault” for not tracking down magic_quotes_gpc on their own. The attitude of the developers on this issue is a clear disdain for any non-cutting-edge technology: They don’t support Apache 1, they don’t support earlier than the latest version of PHP, they don’t support magic_quotes_gpc, despite the fact that having magic quotes on was the default configuration in PHP up until 5.3.0, which (as I said) was released 06/30/2009, less than a year ago.

Now, there’s some blame to go around here – my host, 1&1, is somewhat notorious for being a slow adopter of technology. They are still on Apache 1, they were slow to adopt PHP5, they were slow to adopt mySQL5, and they are slow on the PHP updates. They also have magic_quotes_gpc turned on by default, and they are running PHP5 as a CGI module, instead of as an Apache module (which is the preferred configuration), so the .htaccess directives in Gallery 3’s distribution don’t work properly. Because of these issues, plus crappy support, plus cost, I’m strongly considering moving my websites to a different host.

That said, it should be the responsibility of the Gallery developers to inform the user when their server configuration doesn’t meet specs (if they require Apache 2, warn the user if they are running Apache 1, etc). Or, they could just make sure that their software works with magic_quotes_gpc on – to do so is about 10 lines of code. I’ve been escaping superglobals for years now. If we, as web developers, need to continue to design websites that work with older browsers, they should certainly design their software to work with older server software. Not all of us have the knowledge, ability, or resources to manage our own servers with root access.

Given my configuration settings, the solution to my problem was to override the magic quotes setting using php.ini files. I say files and not file, because php.ini doesn’t cascade – it only affects the CWD. Therefore, I needed to put a php.ini file disabling magic quotes in every directory that contained php files in my Gallery installation. There are 108. Plus, any time I would upgrade Gallery, there’s a strong likelihood that these folders would get replaced, added to, changed, etc – so I would need to check the existence of php.ini files every time I ran an upgrade. Since I do happen to know what I’m doing with PHP, I wrote a PHP script to do this for me, which I will have to run at the end of every upgrade:

What’s worse, I had to plumb the database for backslash characters, and manually edit every entry that had them. This process was actually harder than I anticipated – apparently, when using the SQL LIKE clause, you have to double-escape your backslashes, like so:

Everything is all cleaned up now, but it’s going to require ongoing vigilance on my part to ensure that this problem stays fixed, because the Gallery team didn’t think it was important to support users that have magic_quotes_gpc turned on and can’t do anything about it, even at the .htaccess level. Hopefully the developers hear about this issue from other users and consider doing something more concrete about it in the future. In the meantime, I’m going to go shopping for new hosting companies.