Change language

SQLAlchemy and why I used to despise ORMs

| |

It so happened that at the beginning of my career in IT I was bitten by Oracle - at that time I didn't know any ORMs yet, but I already knew SQL and knew how huge database capabilities are.

SQLAlchemy

My introduction to DjangoORM put me into a deep frustration. Instead of features -- crap and two, not a composite primary key or window functions. Specific database features are easier to forget. What was killing me was that at the price of zero flexibility I was sold a drop in performance -- ORM query building is not free. And the cherry on the cake -- in addition to SQL syntax, you need to know the syntax of ORM, which will generate this SQL. A disadvantage that I bought for the additional cognitive load -- that's quite an industry achievement. So I seriously believed that it was easier, more flexible and many times more productive without ORM -- after all, you have all the power of the database at your fingertips.

So, this SQLAlchemy story is the happy story of how I rediscovered ORM. In this article, I will tell you how I got to this life, some of the pitfalls of SQLAlchemy, and finally I will move on to what made me excited, which I will try to share with you.

Experience and as a result a subjective system of views

I used to optimise SQL queries. I managed to achieve a hundred times or more reduction of query costs, mainly for Oracle and Firebird. I did research, experimented with indexes. I have seen a lot of database schemes in my life: among them there were some crap, and some thoughtful flexible and extensible engineering solutions.

This experience has shaped my system of views about databases:

ORM doesn't let you forget about database engineering unless you want to bury the project tomorrow

  • Portability is a myth, not an argument
  • If your project works with postgres via ORM, you're on a local machine deploying postgres in a docker, not working with sqlite
  • Have you experienced a lot of DB migrations? Don't just write "One day we decided to move..." -- it was once. If it happens often, or if it is a stated feature justified by the different operating conditions of your different customers -- welcome to the discussions
  • Different databases have their own advantages and pains, all due to different data structures and different engineering solutions. And if we use top brain when writing an application, we try to avoid those sore spots. The topic is deep, and is covered in the lecture series Databases for the Programmer and Transactions by Vladimir Kuznetsov
  • The structure of tables is determined by your data, not by the limitations of your ORM

Of course, I also wrote code outside the database, and I have formed a system of views about this code too:

  • The controller should be thin, and the best code is the code that isn't. ORM code is part of the controller. And just because the controller code is hidden in a library doesn't mean it's thin -- it's still executable
  • A controller that executes a lot of database accesses in a single session -- that's very thin ice
  • I avoid the widespread use of ActiveRecord -- it's a surefire way to both work with non-consistent data and unnoticeably generate an uncontrollable number of database accesses
  • Optimising our work with the database comes down to not reading unnecessary data. It makes sense to request only the list of columns we are interested in
  • The front end still requests some data during initialisation. Most often these are categories. In such cases, we only need to give the id
  • Debugging of all new ORM requests is mandatory. You should always check what the ORM shits out (here are a couple of juicy examples), so that you don't get round eyes. Even when writing this article I had a mistake on this very point

The idea of reducing the amount of code executed in the controller leads me to the fact that it is easier not to deal with entities, but to immediately request data from the database in the required form, and the output can be immediately given to the JSON serialiser.

All the questions in this article come from my experience and system of views

Introduction to SQLAlchemy

The first thing that caught my eye -- the ability to write SQL-style DML queries, but in python syntax:


order_id = bindparam('order_id', required=True)
return (
    select(
        func.count(Product.id).label("product_count"),
        func.sum(Product.price).label("order_price"),
        Customer.name,
    )
    .select_from(Order)
    .join(
        Product,
        onclause=(Product.id == Order.product_id),
    )
    .join(
        Customer,
        onclause=(Customer.id == Order.customer_id),
    )
    .where(
        Order.id == order_id,
    )
    .group_by(
        Order.id,
    )
    .order_by(
        Product.id.desc(),
    )
)

With this code sample I want to say that ORM is not trying to invent its own criteria, instead it is trying to give something as similar to SQL as possible. Unfortunately, I have replaced the actual ORM query fragment of the current project, for NDA reasons. The example is extremely primitive -- it doesn't even have subqueries. It seems that there are only a few such queries in my current project. 

Naturally, I immediately started looking for composite primary keys -- and there they are! There are window functions, CTEs, explicit JOINs, and much more! You can even add SQL hints for especially hard cases! Further immersion continues to please: I didn't encounter a single issue that was impossible to solve due to architectural limitations. However, I solved some of my issues through monkey-patching.

Performance

No matter how cool and flexible the API is, the cornerstone is the issue of performance. Today 10 rps may be enough for you, but tomorrow you are trying to scale, and if the database stalls -- congratulations, you are dead.

The query builder performance in SQLAlchemy leaves much to be desired. Thankfully, this is the application level, and scaling will save you here. But is there any way around it? Is there any way to compensate for the poor performance of the query builder? No, seriously, what's the point of wasting power to increase the entropy of the universe?

In principle, we at python are not used to finding workarounds: for example, python is unsuitable for implementing number crunchers, so calculations are usually thrown into C libs.

SQLAlchemy also has workarounds, and there are two of them at once, and both of them reduce to caching by different strategies. The first -- using bindparam and lru_cache. The second one is offered by the documentation -- future_select. Let's consider their advantages and disadvantages.

bindparam + lru_cache

This is the simplest and at the same time the most productive solution. We buy performance for the price of memory -- we simply cache the assembled query object, which caches the rendered query in itself. This is beneficial as long as we are not in danger of combinatorial explosion, that is, as long as the number of query variations is within reasonable limits. In my project, I use this approach in most views. For convenience, I use the cached_classmethod decorator, which implements a composition of the classmethod and lru_cache decorators:

from functools import lru_cache


from functools import lru_cache

def cached_classmethod(target):
    cache = lru_cache(maxsize=None)
    cached = cache(target)
    cached = classmethod(cached)
    return cached

For static views, everything is clear here -- the function that creates the ORM request should not take any parameters. For dynamic views, you can add arguments to the function. Since lru_cache uses dict under the bonnet, the arguments should be hashable. I settled on an option where the query handler function generates a "summary" of the query and parameters passed to the generated query at direct execution time. The "summary" of the query implements something like an ORM query plan, based on which the query object itself is generated -- a hashable frozenset instance, which in my example is called query_params:



class BaseViewMixin:
    def build_query_plan(self):
        self.query_kwargs = {}
        self.query_params = frozenset()

    async def main(self):
        self.build_query_plan()
        query = self.query(self.query_params)

        async with BaseModel.session() as session:
            respone = await session.execute(
                query,
                self.query_kwargs,
            )
            mappings = respone.mappings()

        return self.serialize(mappings)
	        return self.serialise(mappings)
            

Some clarification on query_params and query_kwargs

In the simplest case query_params can be obtained by simply converting query_kwargs to frozenset. Please note that this is not always true: flags in query_params can easily change the SQL query itself while query_kwargs remain unchanged.

Just in case, I warn you not to blindly copy the code. Figure it out, adapt it to your project. Even my code actually looks a bit different, it is deliberately simplified and some unimportant details have been removed from it.

How much memory did I pay for it? Not much. I use no more than a megabyte for all variations of queries.

future_select

Unlike the banal first variant, future_select caches pieces of SQL queries from which the final query is built very quickly. This variant is good for everything: high performance and low memory consumption. It is difficult to read this code, but it is wild to maintain it:


stmt = lambdas.lambda_stmt(lambda: future_select(Customer))
stmt += lambda s: s.where(Customer.id == id_)

This is an option I will definitely engage when things smell a combinatorial explosion.

Intermediate conclusion: the poor performance of the query builder in SQLAlchemy can be levelled by a query cache. The wild future_select syntax can be hidden behind a facade.

I also didn't pay enough attention to prepared statements. I will do this research a bit later.

How I rediscovered ORM

We have reached the main point -- this is the reason I wrote the article. In this section, I will share my revelations that came to me in the process.

Modularity

When I implemented wild analytics in SQL, the lack of modularity and introspection was an old pain. When later porting to ORM, I already had an opportunity to throw the entire FROM field subquery into a separate function (actually a class method), and later these functions were easy to combine and implement the Strategy pattern based on flags, as well as to avoid duplication of the same functionality through inheritance.

Own types

If data has tricky behaviour, or is tricky to transform, it's pretty obvious that it needs to be thrown out at the model level. I ran into two issues: colour storage and working with ENUM. Let's go in order.

Creating your own simple types is discussed in the documentation:



class ColorType(TypeDecorator):
    impl = Integer
    cache_ok = True

    def process_result_value(self, value, dialect):
        if value is None:
            return

        return color(value)

    def process_bind_param(self, value, dialect):
        if value is None:
            return

        value = color(value)
        return value.value
        

The only reason for this is that I wanted to store colours in ints instead of strings. It excludes incorrect data but complicates their serialisation and deserialisation.

Now about ENUM. I was strongly displeased with the fact that the documentation suggests storing ENUMs in the database as VARCHAR. Especially unique integer Enum wanted to be stored as ints. Obviously, we should declare this type by passing the argument to the target Enum. Well, since String requires to specify the length when declaring -- the problem is obviously already solved. Digging through the source code brought me to TypeEngine -- and here instead of usage examples you are met with "our source code is open 24/7". But here everything is simple:



class IntEnumField(TypeEngine):
    def __init__(self, target_enum):
        self.target_enum = target_enum
        self.value2member_map = target_enum._value2member_map_
        self.member_map = target_enum._member_map_

    def get_dbapi_type(self, dbapi):
        return dbapi.NUMBER

    def result_processor(self, dialect, coltype):
        def process(value):
            if value is None:
                return

            member = self.value2member_map[value]
            return member.name

        return process

    def bind_processor(self, dialect):
        def process(value):
            if value is None:
                return

            member = self.member_map[value]
            return member.value

        return process

Note: both functions -- result_processor and bind_processor -- must return a function.

Own functions, tie-hints, and type inference

Next up. I've encountered oddities in mariadb's implementation of json_arrayagg: in case of an empty set, the string "[NULL]" is returned instead of NULL -- which is not good under any sauce. As a temporary solution I've docked a bunch of group_concat, coalesce and concat. It's not bad in principle, but:

When reading out the result, I want native conversion of the string to JSON.

If you do something universal, it turns out that strings must be escaped. Thankfully, there is a built-in function json_quote. SQLAlchemy doesn't know about it.

And I also want to find workaround-functions in sqlalchemy.func object

It turns out that SQLAlchemy solves these problems quite easily. And if I found the type hints just convenient, the type inference delighted me: type-dependent behaviour can be encapsulated in the function itself, which will generate correct SQL code.

I was allowed by the customer to publish the code of an entire module!

As part of the experiment, I also wrote a json_object_wrapper function that builds json from passed fields, where the keys are the field names. Whether I'll use it or not, I don't know. And the fact that these macro substitutions don't just work, but even work correctly, scares me a bit.

Examples of what ORM generates

These macro substitutions allow you to generate a huge pile of SQL code that will execute the logic for generating views. And what I find fascinating is that this pile of code works efficiently. Another interesting thing is that these macro substitutions will allow to implement the Strategy pattern transparently -- I hope that json_arrayagg behaviour will be fixed in the next MariaDB releases, and then I can replace my crutch with json_arrayagg+coalesce without the client code noticing.

Summing Up

SQLAlchemy allows you to take advantage of inheritance and polymorphism (and even a bit of encapsulation. Flash royalty, though) in SQL. At the same time, it does not lock you into Hello, World! level tasks with architectural limitations, but on the contrary, it gives you maximum possibilities.

Subjectively, this is a breakthrough. I adore relational bases, and finally I enjoy the implementation of intricately twisted analytics. I have in my hands all the advantages of OOP and all the possibilities of SQL.

Shop

Gifts for programmers

Best laptop for Excel

$
Gifts for programmers

Best laptop for Solidworks

$399+
Gifts for programmers

Best laptop for Roblox

$399+
Gifts for programmers

Best laptop for development

$499+
Gifts for programmers

Best laptop for Cricut Maker

$299+
Gifts for programmers

Best laptop for hacking

$890
Gifts for programmers

Best laptop for Machine Learning

$699+
Gifts for programmers

Raspberry Pi robot kit

$150

Latest questions

PythonStackOverflow

Common xlabel/ylabel for matplotlib subplots

1947 answers

PythonStackOverflow

Check if one list is a subset of another in Python

1173 answers

PythonStackOverflow

How to specify multiple return types using type-hints

1002 answers

PythonStackOverflow

Printing words vertically in Python

909 answers

PythonStackOverflow

Python Extract words from a given string

798 answers

PythonStackOverflow

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

606 answers

PythonStackOverflow

Python os.path.join () method

384 answers

PythonStackOverflow

Flake8: Ignore specific warning for entire file

360 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

How to specify multiple return types using type-hints

Printing words vertically in Python

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically