5

I would like to extract the column names of a resulting table directly from the SQL statement:


query = """

select 
    sales.order_id as id, 
    p.product_name, 
    sum(p.price) as sales_volume 
from sales
right join products as p 
    on sales.product_id=p.product_id
group by id, p.product_name;

"""

column_names = parse_sql(query)
# column_names:
# ['id', 'product_name', 'sales_volume']

Any idea what to do in parse_sql()? The resulting function should be able to recognize aliases and remove the table aliases/identifiers (e.g. "sales." or "p.").

Thanks in advance!

1
  • 2
    You could try the python-sqlparse library. Commented Jan 17, 2022 at 17:28

3 Answers 3

5

I've done something like this using the library sqlparse. Basically, this library takes your SQL query and tokenizes it. Once that is done, you can search for the select query token and parse the underlying tokens. In code, that reads like

import sqlparse
def find_selected_columns(query) -> list[str]:
    tokens = sqlparse.parse(query)[0].tokens
    found_select = False
    for token in tokens:
        if found_select:
            if isinstance(token, sqlparse.sql.IdentifierList):
                return [
                    col.value.split(" ")[-1].strip("`").rpartition('.')[-1]
                    for col in token.tokens
                    if isinstance(col, sqlparse.sql.Identifier)
                ]
        else:
            found_select = token.match(sqlparse.tokens.Keyword.DML, ["select", "SELECT"])
    raise Exception("Could not find a select statement. Weired query :)")

This code should also work for queries with Common table expressions, i.e. it only return the final select columns. Depending on the SQL dialect and the quote chars you are using, you might to have to adapt the line col.value.split(" ")[-1].strip("`").rpartition('.')[-1]

Sign up to request clarification or add additional context in comments.

2 Comments

Nice, this does exactly what I was looking for. Thanks a lot! I was aware of sqlparse, but didn't know how to utilize it properly.
this does not work as intended, sqlparse apparently splits columns into batches, so this function only returns a subset of columns.
5

Try out SQLGlot

It's much easier and less error prone than sqlparse.

import sqlglot
import sqlglot.expressions as exp

query = """
select
    sales.order_id as id,
    p.product_name,
    sum(p.price) as sales_volume
from sales
right join products as p
    on sales.product_id=p.product_id
group by id, p.product_name;

"""

column_names = []

for expression in sqlglot.parse_one(query).find(exp.Select).args["expressions"]:
    if isinstance(expression, exp.Alias):
        column_names.append(expression.text("alias"))
    elif isinstance(expression, exp.Column):
        column_names.append(expression.text("this"))

print(column_names)

4 Comments

is there any way to get both the column name and alias?
yup, check the github readme, "metadata" section github.com/tobymao/sqlglot
so, when the alias is set up as "select [AliasColumnName] = ColumnName from tble" the sqlglot does not see it. But it does work if the alias is set up as "select ColumnName AS AliasColumnName from tble". Is there a workaround it?
you can do it even cleaner with some list comp [col_obj.name or col_obj.alias for col_obj in parse_one(query).expressions]
0

Use sql_metadata, which leverages sqlparse:

from sql_metadata import Parser


def get_query_columns(query) -> dict[str, str]:
    return Parser(query).columns_aliases

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.