If I Can't Who Else.....

Wednesday, September 18, 2013

The Cardinality And Selectivity Of SQL Conditions

Source Link/Original post:

http://www.bennadel.com/blog/1424-Exploring-The-Cardinality-And-Selectivity-Of-SQL-Conditions.htm

The selectivity is what goes with the cardinality concept.

The "cardinality" refers to the number of "distinct" values. For a column "SEX". the possible values are "male" and "female" so, your cardinality for that column would be 2, no matter how many rows you have in that table.

The selectivity is the "number of rows" / "cardinality", so if you have 10K customers, and search for all "female", you have to consider that the search would return 10K/2 = 5K rows, so a very "bad" selectivity.

The column for the primary key on the other side is "unique", and hence the cardinality is equal to the number of rows. so, the selectivitiy for searching a value in that column will be 1, by definition, which is the best selectivity possible

Let's take a look at the example (and follow-up comments) about the database as a bottleneck.

Example: I had a Job table that has the following keys (the relevant ones at least):

contact_id (int)
institution_id (int)
contact_role_id (int)
date_started (date/time)
date_ended (date/time)

From this table, I was trying to select all contacts that had the Job type (contact_role_id) as "CEO". For this one, I am going to select the jobs in which a given contact has a given role type. Before I really thought about any of the selectivity of my statements, I would have done something like this:

FROM
contact c
INNER JOIN
job j
ON
(
c.is_deleted = 0
AND
c.id = j.contact_id
AND
j.is_deleted = 0
AND
j.contact_id = #FORM.contact_id#
AND
j.contact_role_id = #ContactRoles.CEO#
AND
j.date_started <= NOW()
AND
(
j.date_ended IS NULL
OR
j.date_ended > NOW()
)
)

My database uses logical rather than physical deletes, so as a matter of practice, I put the is_deleted checks on either side of the Join so that I would not forget to include them. After that, I then included the contact ID (assuming we are looking for jobs for a given contact) and the role, followed lastly by the date/time constraints (assuming we're looking for active jobs only).

Now that that is out there, let's walk through my choices in terms of Cardinality and Selectivity. I am going to throw out some estimated and hypothetical values. The inaccuracy of these values may affect the specifics, but I think it will still be relevant. For the following thought experiment, I am going to assume that I have a 1,000 contacts and 1,2000 jobs. I am also going to assume that there will always be more jobs than contacts and that, in fact, this 1:1.2 ratio will remain fairly constant.

is_deleted. This field is a yes/no toggle, therefore the Cardinality of the field is 2. So, if we take a sample size of 1,000 contact records, we get a Selectivity of 1000 / 2 = 500. If we take the sample size of 1,200 jobs, we get a Selectivity of 1,2000 / 2 = 600. The relative selectivity here tells us that our logical delete should be checked on the contact first and then on the job second. However, neither of these selectivities are very good, so they should come later in our JOIN clause, probably not at the top where I have them.

contact_id. This field is a foreign key to the contact table. This is an interesting one because I could check the contact_id in the contact table OR I could check it in the job table. Let's see if that matters; if we put the filter on the contact table, we will have a separate ID for each contact. This means that for 1,000 contacts, our cardinality will also be 1,000 which will yield a selectivity of 1 (the best we can hope for). On the other hand, if we go with the contact_id in the job table (as we are in the example above), we have a cardinally of 1,000 and a row count of 1,200 which yields a selectivity of 1,2000 / 1,000 = 1.2. This is still very good, but not as good as the first choice. Clearly, our FORM.contact_id should be checked against contact.id rather than job.contact_id.

contact_role_id. There are only about 10 job types in my system, giving us a cardinality of 10. And, since this column only exists in the job table, this yields a selectivity of 1,2000 / 10 = 120. This is much better than the is_deleted checks, but worse than the contact_id check; it should therefore go between the above two conditions.

date_started. This is a strange one to think about. Assuming that the system does not allow anyone to create future jobs (which mine does not currently do), it means that all rows in the job table will have a past date_started column value. But, how does that affect Cardinality? How can one put a solid number on this "set" of data that will change with time? For the sake of argument (and please correct me if this is bad), since the date lives in a binary context - either before today or after today - I'm going to give it a cardinality of 2 yielding a selectivity of 1,200 / 2 = 600. This selectivity is pretty bad, among the worst that we have seen so far, and should therefore go at the end of the conditions.

date_ended. At first, I started to think about this differently than the date_started column since it helps us to return meaningful, active jobs. But, is this really any different? What are the possible values for this column? Yes, there can be many dates, but what are the sub-sets of possible values: NULL, before today, and after today. There are three sets of categorical values. I think we can translate this to a cardinality of 3 yielding a selectivity of 1,200 / 3 = 400.

Things are starting to get interesting when we think about the selectivity of our different conditions, especially with cases like "contact_id" which can be tested in several different tables. Based only on the selectivities calculated above, I suppose I would order the JOIN conditions as such:

FROM
contact c
INNER JOIN
job j
ON
(
c.id = #FORM.contact_id#
AND
j.contact_role_id = #ContactRoles.CEO#
AND
(
j.date_ended IS NULL
OR
j.date_ended > NOW()
)
AND
c.is_deleted = 0
AND
j.is_deleted = 0
AND
j.date_started <= NOW()
AND
c.id = j.contact_id
)

Tuesday, November 23, 2010

Oracle: Steps to find session and kill a session

SELECT
vs.username, vs.osuser , vs.sid , serial#, vsql.sql_text
FROM v$session vs , v$sqlarea vsql
WHERE
vs.sql_hash_value = vsql.hash_value
AND vs.sql_address = vsql.address
AND vs.status = 'ACTIVE'
AND vs.username = 'SUBHENDU'

ALTER SYSTEM KILL SESSION 'sid,serial#' IMMEDIATE;

Wednesday, June 17, 2009

Spring Help

http://static.springframework.org/docs/Spring-MVC-step-by-step/
http://www.springsource.org/documentation
http://www.ibm.com/developerworks/web/library/wa-spring1/

http://www.martinfowler.com/articles/injection.html
http://java.dzone.com/print/5878

http://static.springframework.org/spring/docs/2.0.x/reference/index.html

Tuesday, May 26, 2009

Analyze and Increase the Table Space

1. Query to get TABLE SPACES Information

Table DBA_TABLESPACES describes all tablespaces in the database.

· Tablespace Name - Name of the tablespace

· Initial Extent - Default initial extent size

· Next Extent - Default incremental extent size

· Min Extents - Default minimum number of extents

· Max Extents - Default maximum number of extents

· PCT Increase - Default percent increase for extent size

· Status - Tablespace status: ONLINE, OFFLINE, or INVALID

select  TABLESPACE_NAME, INITIAL_EXTENT, NEXT_EXTENT, MIN_EXTENTS, MAX_EXTENTS,PCT_INCREASE,STATU

from    dba_tablespacesorder   by TABLESPACE_NAME

2. Query to get DATA FILES Information

Table DBA_DATA_FILES describes all data files in the database

· FILE_NAME Name of the database file along with the path

· FILE_ID File identifier number of the database file

· TABLESPACE_NAME Name of the tablespace to which the file belongs

· BYTES Size of the file in bytes

· BLOCKS Size of the file in Oracle blocks

· STATUS File status: AVAILABLE or INVALID

· AUTOEXTENSIBLE Auto extensible indicator

· MAXBYTES Maximum file size in bytes

· MAXBLOCKS Maximum file size in blocks

· INCREMENT_BY Number of tablespace blocks used as auto extension increment. Block size is contained in the BLOCK_SIZE column of the DBA_TABLESPACES view.

· USER_BYTES The size of the file available for user data. The actual size of the file minus theUSER_BYTES value is used to store file related metadata.

· ONLINE_STATUS Online status of the file:

3. Query to get DBA FREE SPACE Information

Table DBA_FREE_SPACE describes the free extents in the tablespaces accessible to the current user

· TABLESPACE_NAME Name of the tablespace containing the extent

· FILE_ID File identifier number of the file containing the extent

· BLOCK_ID Starting block number of the extent

· BYTES Size of the extent (in bytes)

· BLOCKS Size of the extent (in Oracle blocks)

4. Query to get TABLESPACE USAGE Information

·TABLESPACE_NAME - Name of the tablespace

·Bytes Used - Size of the file in bytes

·Bytes Free - Size of free space in bytes

·Largest - Largest free space in bytes

·Percent Used - Percentage of tablespace that is being used

select  a.TABLESPACE_NAME, a.BYTES bytes_used, b.BYTES bytes_free, b.largest, round(((a.BYTES-b.BYTES)/a.BYTES)*100,2)||'%' percent_used

from    ( select TABLESPACE_NAME, sum(BYTES) BYTES  from dba_data_files group by TABLESPACE_NAME) a,

( select TABLESPACE_NAME, sum(BYTES) BYTES ,max(BYTES) largest from dba_free_space group by TABLESPACE_NAME) b where   a.TABLESPACE_NAME=b.TABLESPACE_NAME

order   by ((a.BYTES-b.BYTES)/a.BYTES) desc

N.B. - Above query may not display information about some Tablespace because table DBA_FREE_SPACE may not contain information of a table space, if a datafile or entire tablespace is offline in a locally managed tablespace or a tablespace is full.

If Percent Used more than 85% we have to take care of it.

5. DDL Query to increase TABLESPACE

A. Increase table space by creating a data file (*dbf) and associate with table space.

1. Find datafile storage path (one way by querying dba_data_files. FILE_NAME concern with the data file storage path)

2. Chose a unique datafile(.dbf) name that doesn’t exists in database (By querying dba_data_files)

3. Chose one TABLESPACE whose size you need to increase

4. Chose the initial size of datafile you need to create (K, M, G, and T i.e. KB, MB…….)

5. Chose either Automatically extend datafile when full (AUTOEXTEND) or not if auto extendable then what should be the increment size and what should be the max size (or UNLIMITED)

With AUTOEXTEND property

ALTER TABLESPACE "USERS" ADD DATAFILE '/Oracle/oradata/subhendu/USER05' SIZE 10GAUTOEXTEND ON NEXT 123123K MAXSIZE UNLIMITED

Without AUTOEXTEND property

ALTER TABLESPACE "USERS" ADD DATAFILE '/Oracle/oradata/subhendu/USER05' SIZE 10G

B. Increasing table space by resizing an existing data file (*dbf)

ALTER DATABASE DATAFILE '/Oracle/oradata/subhendu/USER05' RESIZE 50G;

ALTER DATABASE DATAFILE '/Oracle/oradata/subhendu/USER05' AUTOEXTEND ON NEXT SIZE10G MAXSIZE UNLIMITED;

N.B. - Beware when decrease the size of the datafile with the space that is free between highest used block of the datafile and the last block of the file.

Monday, May 25, 2009

How to start isqlplus

To start isqlplus, a web version of sqlplus, issue

isqlplusctl start

Thereafter, the isqlplus is accessible from http://hostip:5560/isqlplus.

If I Can't Who Else.....

Wednesday, September 18, 2013

The Cardinality And Selectivity Of SQL Conditions

The Cardinality And Selectivity Of SQL Conditions

Tuesday, November 23, 2010

Oracle: Steps to find session and kill a session

Wednesday, June 17, 2009

Spring Help

http://static.springframework.org/docs/Spring-MVC-step-by-step/
http://www.springsource.org/documentation
http://www.ibm.com/developerworks/web/library/wa-spring1/

http://www.martinfowler.com/articles/injection.html
http://java.dzone.com/print/5878

http://static.springframework.org/spring/docs/2.0.x/reference/index.html

Tuesday, May 26, 2009

Analyze and Increase the Table Space

Monday, May 25, 2009

How to start isqlplus

About Me

Blog Archive