Wednesday, October 31, 2012

Improving My Use of Database Lookup Tables in PHP


When you're deep in code and realize that each item of a batch needs to check some other info from somewhere else, the most straightfoward way to get the info you need is to just fetch what you need right there.  However, this tends not to scale well... doing one-hit lookups for each item from a batch of 100 isn't bad, but when that batch becomes 10,000?  100,000?  A database connection can execute an individual SQL only so fast... 10,000 of them will add up on the clock.

When the size of the lookup table involved is itself relatively small compared to your dataset, it's worth considering a lazy loaded singleton lookup array.

By initializing your lookup array variable to something like $lookup = "NOT_SET_YET", a string, you can lazy load it at the first point your code needs it, just by checking is_array($lookup).  This is a small optimization, in that if your to-process data loop never gets as deep as needing this particular lookup, you don't waste time loading it in the first place.  This is_array($lookup) piece is also the step that makes it a singleton... once it's loaded, it doesn't have to be reloaded.  Just be sure you did that $lookup = "NOT_SET_YET" initialization outside the scope of the data processing loop... otherwise, you are indeed reloading it every time.

For the data loading itself, I try to structure it more akin to a database index than a table.  When my lookup needs to know that a given row does or does not exist in that table, I combine the column values I need into the array keys rather than values.  This allows me to just use isset() when doing the lookup.  I wrestled often with having to add isset() to my original array-checking function usage, where I'd get Notices for undefined items.  It occurred to me that if my lookup itself could short-circuit directly to the member or not, then it would be like having a database index to check.  Most of the time now, I check for isset($lookup['DATA-I-WANT']) and I'm done.

I at least put TRUE in the values because isset() won't see a key if its value is NULL.  Further, in some of my use cases, I have found that using both TRUE and FALSE could be helpful, depending on how I needed to read the data itself.  In these cases, I'm back to doing both isset() and checking the array value, but it's still better than doing a SQL query.

In a recent use case, I needed to know if a given foreign key in my lookup table had any rows at all... then and only then did the lookup for a specific row need to be done.  In order to roll this capability into the same array that had keys like $lookup[$oneGivenFK . $oneGivenPK], I also loaded in $lookup['FK_exists--' . $oneGivenFK].  This allowed me to do the isset($lookup['FK_exists--' . $oneGivenFK]), and if TRUE, proceed with the isset($lookup[$oneGivenFK . $oneGivenPK]) check.

A lesson learned here is to put a painfully obvious delimiter in between the values used to construct these big keys.  I tend to use three dashes, "---", to avoid any coincidence of my delimiter from being a part of the actual values themselves.

2 comments:

Unknown said...

IMHO the best approach is leaving that kind of work to the database: take use of left OUTER join, use views, create triggers to update specific information regarding a related sub-table...
Avoid doing with PHP what the database can provide with little effort, and specially always ask yourself if you really need those querys inside a cycle.

The Nazg said...

Yes indeed, when you can be satisfied that you accept that your business logic is thus tied to your database of choice, that approach can succeed.

I should probably caveat my own views as being from the perspective of PHP itself being my "insulation" against platform migrations. This viewpoint is a holdover from my Java days, when I experienced migrations of database vendors as well as OS vendors. As such, I tend to limit my database dependencies to basic constraints and ANSI SQL.

To your point, Hugo, when I can indeed structure all of my data gathering into one large SQL, I agree that it's best to do that versus query->dataset->loop->repeatEverDeeper.

This singleton lookup array is one of my first steps in refactoring my way out of that scenario ;-)