Wednesday, September 18, 2013

The Cardinality And Selectivity Of SQL Conditions

The Cardinality And Selectivity Of SQL Conditions

Source Link/Original post:

The selectivity is what goes with the cardinality concept. 
The "cardinality" refers to the number of "distinct" values. For a column "SEX". the possible values are "male" and "female" so, your cardinality for that column would be 2, no matter how many rows you have in that table.
The selectivity is the "number of rows" / "cardinality", so if you have 10K customers, and search for all "female", you have to consider that the search would return 10K/2 = 5K rows, so a very "bad" selectivity.
The column for the primary key on the other side is "unique", and hence the cardinality is equal to the number of rows. so, the selectivitiy for searching a value in that column will be 1, by definition, which is the best selectivity possible
   Let's take a look at the example (and follow-up comments) about the          database as a bottleneck. 
Example: I had a Job table that has the following keys (the relevant ones at least):
  • contact_id (int)
  • institution_id (int)
  • contact_role_id (int)
  • date_started (date/time)
  • date_ended (date/time)
From this table, I was trying to select all contacts that had the Job type (contact_role_id) as "CEO". For this one, I am going to select the jobs in which a given contact has a given role type. Before I really thought about any of the selectivity of my statements, I would have done something like this:
  • FROM
  • contact c
  • INNER JOIN
  • job j
  • ON
  • (
  • c.is_deleted = 0
  • AND
  • c.id = j.contact_id
  • AND
  • j.is_deleted = 0
  • AND
  • j.contact_id = #FORM.contact_id#
  • AND
  • j.contact_role_id = #ContactRoles.CEO#
  • AND
  • j.date_started <= NOW()
  • AND
  • (
  • j.date_ended IS NULL
  • OR
  • j.date_ended > NOW()
  • )
  • )

My database uses logical rather than physical deletes, so as a matter of practice, I put the is_deleted checks on either side of the Join so that I would not forget to include them. After that, I then included the contact ID (assuming we are looking for jobs for a given contact) and the role, followed lastly by the date/time constraints (assuming we're looking for active jobs only).
Now that that is out there, let's walk through my choices in terms of Cardinality and Selectivity. I am going to throw out some estimated and hypothetical values. The inaccuracy of these values may affect the specifics, but I think it will still be relevant. For the following thought experiment, I am going to assume that I have a 1,000 contacts and 1,2000 jobs. I am also going to assume that there will always be more jobs than contacts and that, in fact, this 1:1.2 ratio will remain fairly constant.
is_deleted. This field is a yes/no toggle, therefore the Cardinality of the field is 2. So, if we take a sample size of 1,000 contact records, we get a Selectivity of 1000 / 2 = 500. If we take the sample size of 1,200 jobs, we get a Selectivity of 1,2000 / 2 = 600. The relative selectivity here tells us that our logical delete should be checked on the contact first and then on the job second. However, neither of these selectivities are very good, so they should come later in our JOIN clause, probably not at the top where I have them.
contact_id. This field is a foreign key to the contact table. This is an interesting one because I could check the contact_id in the contact table OR I could check it in the job table. Let's see if that matters; if we put the filter on the contact table, we will have a separate ID for each contact. This means that for 1,000 contacts, our cardinality will also be 1,000 which will yield a selectivity of 1 (the best we can hope for). On the other hand, if we go with the contact_id in the job table (as we are in the example above), we have a cardinally of 1,000 and a row count of 1,200 which yields a selectivity of 1,2000 / 1,000 = 1.2. This is still very good, but not as good as the first choice. Clearly, our FORM.contact_id should be checked against contact.id rather than job.contact_id.
contact_role_id. There are only about 10 job types in my system, giving us a cardinality of 10. And, since this column only exists in the job table, this yields a selectivity of 1,2000 / 10 = 120. This is much better than the is_deleted checks, but worse than the contact_id check; it should therefore go between the above two conditions.
date_started. This is a strange one to think about. Assuming that the system does not allow anyone to create future jobs (which mine does not currently do), it means that all rows in the job table will have a past date_started column value. But, how does that affect Cardinality? How can one put a solid number on this "set" of data that will change with time? For the sake of argument (and please correct me if this is bad), since the date lives in a binary context - either before today or after today - I'm going to give it a cardinality of 2 yielding a selectivity of 1,200 / 2 = 600. This selectivity is pretty bad, among the worst that we have seen so far, and should therefore go at the end of the conditions.
date_ended. At first, I started to think about this differently than the date_started column since it helps us to return meaningful, active jobs. But, is this really any different? What are the possible values for this column? Yes, there can be many dates, but what are the sub-sets of possible values: NULL, before today, and after today. There are three sets of categorical values. I think we can translate this to a cardinality of 3 yielding a selectivity of 1,200 / 3 = 400.
Things are starting to get interesting when we think about the selectivity of our different conditions, especially with cases like "contact_id" which can be tested in several different tables. Based only on the selectivities calculated above, I suppose I would order the JOIN conditions as such:
  • FROM
  • contact c
  • INNER JOIN
  • job j
  • ON
  • (
  • c.id = #FORM.contact_id#
  •  
  • AND
  • j.contact_role_id = #ContactRoles.CEO#
  •  
  • AND
  • (
  • j.date_ended IS NULL
  • OR
  • j.date_ended > NOW()
  • )
  •  
  • AND
  • c.is_deleted = 0
  •  
  • AND
  • j.is_deleted = 0
  •  
  • AND
  • j.date_started <= NOW()
  •  
  •         
  • AND
  • c.id = j.contact_id
  • )