5

I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:

select *
from tableName 
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')

So if columnX contains at least one of the values specified (A, B, C, 1, 2, 3), I will keep the row.

I can't find a better approach than using similar to. The problem is that the query takes too long for a table with more than a million rows.

I've tried indexing it:

create index tableName_columnX_idx on tableName (columnX) 
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')

However, if the condition is variable (the values could be other than A, B, C, 1, 2, 3), I would need a different index for each condition.

Is there any better solution for this problem?

EDIT: Thanks everybody for the feedback. Looks like I've achieved to this point maybe because of a design mistake (topic I've posted in a separated question).

3
  • 1
    are you always going to search over single-character values? Is a query like '%ABC%|%BCD%' possible? Commented Nov 1, 2012 at 19:01
  • 1
    Yes, is a single-character based search. Commented Nov 1, 2012 at 19:07
  • 1
    @FedericoCristina effectively, columnX is a set represented by single-character codes in a text field. Correct? This implies that the order of values in columnX does not matter. Commented Nov 2, 2012 at 0:06

4 Answers 4

4

If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:

CREATE INDEX
        ix_tablename_columnxlist
ON      tableName
USING   GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))

then search against the index:

SELECT  *
FROM    tableName
WHERE   REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']
Sign up to request clarification or add additional context in comments.

3 Comments

A GIN index can be quite expensive to maintain, so it isn't particularly suitable for tables with high rates of inserts/updats/deletes. This may or may not be a problem for this particular application.
Thanks for your answer. Unfortunately these tables are constantly changing (inserts & updated)
@FedericoCristina: not aware of your scenario's details, however, in most cases frequent full scans imply more load on the server than index updates (even GIN). Have you actually tried both solutions? Also if you consider a redesign and are able to install extensions, you could store your flags in a native INTARRAY and index it with much more update-friendly GIST.
3

I agree with Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues, because it occupies a lot of space and adds cost for INSERT, UPDATE and DELETE.

My additional answer is triggered by your statement:

I can't find a better approach than using similar to.

If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. Postgres only includes it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.

Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.

Further reading:

1 Comment

Thanks for the link! You're right about using SIMILAR TO. Maybe I should redesing the solution in order to avoid this scenario. I've edited my question with a reference to a new domain-specific question.
2

This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.

If so, I'd want to remodel this table to use one of the following approaches:

  • Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.

  • Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.

  • Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;

  • Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.

Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.

1 Comment

Looks like you're right about the modelling issue. I should do some refactor there. I've edited my question with a reference to a new domain-specific question.
2

I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?

If there are too many attributes to assign a column each, I might try creating an "attribute" table:

(fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')

and let the DBMS worry about the optimization.

3 Comments

Thanks for your answer. Maybe I should have explained better, those values (A, B, C, 1, 2, 3) where just for the example. The complete set of options is about 24 different ones.
I'm not sure the last bit is right. If columnX has value ZAB and you're searching for A|B that won't match, but should.
oh yeah, sorry. Attribute it to ~40 hours of sleeplessness please :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.