Re: Q on views and performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



So, this email is directed much more towards Postgres Powers That Be. I came across this problem a while ago, and I haven't checked whether it has been improved.

On Mon, 25 Feb 2008, I wrote:
Hi.  I'm trying to optimize...

(Q1)   SELECT a1.word, a2.word
        FROM T a1 JOIN T a2 USING ( zipk )
       WHERE a1.type = <int1>
         AND a2.type = <int2>;

Create an index on T(type, zipk), and then CLUSTER on that index. That will effectively group all the data for one type together and sort it by zipk, making a merge join very quick indeed. I'm not sure whether Postgres will notice that, but it's worth a try.

Statistics are generated on fields in a table, and the one I'm interested in is the correlation coefficient which tells Postgres how costly an index scan sorted on that field would be. This entry is ONLY useful when the result needs to be sorted by that exact field only. For example:

CREATE TABLE test (a int, b int);
// insert a bazillion entries
CREATE INDEX testIndex ON test(a, b);
CLUSTER test ON testIndex;
ANALYSE;

So now we have a table sorted by (a, b), but the statistics only record the fact that it is sorted by a, and completely unsorted by b. If we run:

SELECT * FROM test ORDER BY a;

then the query will run quickly, doing an index scan. However, if we run:

SELECT * FROM test ORDER BY a, b;

then Postgres will not be able to use the index, because it cannot tell how sequential the fetches from the index will be. Especially if we run:

SELECT * FROM test WHERE a = <something> ORDER BY b;

then this is the case.

So, these observations were made a long time ago, and I don't know if they have been improved. A while back I suggested a "partial sort" algorithm that could take a stream sorted by a and turn it into a stream sorted by (a, b) at small cost. That would fix some instances of the problem. However, now I suggest that the statistics are in the wrong place.

At the moment, the correlation coefficient, which is an entry purely designed to indicate how good an index is at index scans, is a statistic on the first field of the index. Why not create a correlation coefficient statistic for the index as a whole instead, and store it elsewhere in the statistics data? That way, instead of having to infer from the first field how correlated an index is, and getting it wrong beyond the first field, you can just look up the correlation for the index.

Opinions?

Matthew

--
If you let your happiness depend upon how somebody else feels about you,
now you have to control how somebody else feels about you. -- Abraham Hicks

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux