Is serializability same as sequential consistency? - serializable

I have found people answering differences between linearizability and searializability, but nowhere have I found people either saying that serializability is same as sequential consistency or it is different from that.
Also I have been pounded with different definitions of the above terms in different articles, books and web pages and I have confused it all.
Could someone please explain the difference between serializability and sequential consistency is it exists.
I would appreciate formal definitions of the above terms additionally if possible (both in plain English and in terms of the program or execution histories).

Serializability is more strict than Sequential consistency.
The definition of Sequential consistency in wiki:
The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
And, the definition of Serializability in wiki:
A transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e. without overlapping in time.
So, the granularity of Sequential consistency is a single operation (e.g., read or write), while that of Serializability is a transaction (i.e., a sequence of operations).
In other words, if a program satisfies serializablity, it also satisfies sequential consistency, and not vice versa.

Related

What's the optimal way to store binary flags / boolean values in each database engine?

I've seen some possible approaches (in some database engines some of them are synonyms):
TINYINT(1)
BOOL
BIT(1)
ENUM(0,1)
CHAR(0) NULL
All major database engine supported by PHP should be noted, but just as a refference it'll be even better if also other engines will be noted.
I'm asking for a design that is best optimized for reading.
e.g. SELECTing with the flag field in the WHERE condition, or GROUP BY the flag.
Performance is much more important than storage space (except when the size has an impact on performance).
And some more details:
While creating the table I can't know if it'll be sparse (if most flags are on or off), but I can ALTER the tables later on, so if there is something I can optimize if I know that, it should be noted.
Also if it's make a difference if there is only one flag (or a few) per row, versus many (or a lot of) flags it should be noted.
BTW, I've read somewhere in SO the following:
Using boolean may do the same thing as
using tinyint, however it has the
advantage of semantically conveying
what your intention is, and that's
worth something.
Well, in my case it doesn't worth nothing, because each table is represented by a class in my application and everything is explicitly defined in the class and well documented.
This answer is for ISO/IEC/ANSI Standard SQL, and includes the better freeware pretend-SQLs.
First problem is you have identified two Categories, not one, so they cannot be reasonably compared.
A. Category One
(1) (4) and (5) contain multiple possible values and are one category. All can be easily and effectively used in the WHERE clause. They have the same storage so neither storage nor read performance is an issue. Therefore the remaining choice is simply based on the actual Datatype for the purpose of the column.
ENUM is non-standard; the better or standard method is to use a lookup table; then the values are visible in a table, not hidden, and can be enumerated by any report tool. The read performance of ENUM will suffer a small hit due to the internal processing.
B. Category Two
(2) and (3) are Two-Valued elements: True/False; Male/Female; Dead/Alive. That category is different to Category One. Its treatment both in your data model, and in each platform, is different. BOOLEAN is just a synonym for BIT, they are the same thing. Legally (SQL-wise) there are handled the same by all SQL-compliant platforms, and there is no problem using it in the WHERE clause.
The difference in performance depends on the platform. Sybase and DB2 pack up to 8 BITs into one byte (not that storage matters here), and map the power-of-two on the fly, so performance is really good. Oracle does different things in each version, and I have seen modellers use CHAR(1) instead of BIT, to overcome performance problems. MS was fine up to 2005 but they have broken it with 2008, as in the results are unpredictable; so the short answer may be to implement it as CHAR(1).
Of course, the assumption is that you do not do silly things such as pack 8 separate columns in to one TINYINT. Not only is that a serious Normalisation error, it is a nightmare for coders. Keep each column discrete and of the correct Datatype.
C. Multiple Indicator & Nullable Columns
This has nothing to do with, and is independent of, (A) and (B). What the columns correct Datatype is, is separate to how many you have and whether it is Nullable. Nullable means (usually) the column is optional. Essentially you have not completed the modelling or Normalisation exercise. The Functional Dependencies are ambiguous. if you complete the Normalisation exercise, there will be no Nullable columns, no optional columns; either they clearly exist for a particular relation, or they do not exist. That means using the ordinary Relational structure of Supertype-Subtypes.
Sure, that means more tables, but no Nulls. Enterpise DBMS have no problem with more tables or more joins, that is what they are optimised for. Normalised databases perform much better than unnormalised or denormalised ones, and they can be extended without "re-factoring'. You can ease the use by supplying a View for each Subtype.
If you want more information on this subject, look at this question/answer. If you need help with the modelling, please ask a new question. At your level of questioning, I would advise that you stick with 5NF.
D. Performance of Nulls
Separately, if performance is important to you, then exclude Nulls. Each Nullable column is stored as variable length; that requires additional processing for each row/column. The enterprise databases use a "deferred" handling for such rows, to allow the logging, etc to move thought the queues without impeding the fixed rows. In particular never use variable length columns (that includes Nullable columns) in an Index: that requires unpacking on every access.
E. Poll
Finally, I do not see the point in this question being a poll. It is fair enough that you will get technical answers, and even opinions, but polls are for popularity contests, and the technical ability of responders at SO covers a very range, so the most popular answers and the most technically correct answers are at two different ends of the spectrum.
I know this is not the answer you want, but the difference is really negligeble in all but the most extreme special cases. And in each such specific case, simply switching datatype won't be enough to fix a performance problem.
For example, here are some alternatives that will outperform any datatype changes by a large factor. Each carries with it a downside of course.
If you have 200 optional flags and you query for at most 1-2 at a time for lots of rows, you would get better performance by having each flag in its own table. If the data is really sparse this gets even better.
If you have 200 mandatory flags and you only perform single record fetches, you should put them in the same table.
If you have a small set of flags, you could pack them in one column using a bitmask, which is efficient storage wise, but you won't be able to (easily) query individual flags. Of course, this doesn't work when flags can be NULL...
Or you could get creative and use a "junk dimension" concept, in which you create a separate table with all 200 boolean flags represented as columns. Create one row for each distinct combination of flag values. Each row gets an autoincrement primary key, which you reference in the master record. Voila, the master table now contains 1 int, instead of 200
columns. Hackers heaven, DBA nightmare.
The point I'm trying to make is that even though it is interesting to argue over which is "the best", there are other concerns that are of much greater importance (like the comment you quoted). Simply because when you encounter a real performance issue, the datatype will neither be the problem nor the solution.
Any of the above is fine and I have a personal preference of using BOOL if it is properly supported because that best conveys your intent but I would avoid using ENUM(0,1).
The first problem with ENUM is that it requires its value to be a string. 0 and 1 looks like a number so programmers have a tendency to send it a number.
The second problem with ENUM is that if you send it a wrong value it defaults to the first enumeration and in some databases it won't even indicate an error (I'm looking at you MySQL). This makes the first problem much worse since if you accidentally send it 1 instead of "1" it will store the value "0" -- very counter-intuitive!
I don't think this affects all database engines (don't know, havent't tried them all) but it affects enough of them that I consider avoiding it to be good practice.

Asking for opinions : One sequence for all tables

Here's another one I've been thinking about lately.
We have concluded in earlier discussions : 'natural primary keys are bad, artificial primary keys are good.'
Working with Hibernate earlier I have seen that Hibernate default creates one sequence for all tables. At first I was puzzled by this, why would you do this. But later I saw the advantage that it makes linking parents and children fool proof. Because no tables have the same primary key value, accidentally linking a parent with a table that is not a child gives no results.
Does anyone see any downsides to this approach. I only see one : you cannot have more than 999999999999999999999999999 records in your database.
Depending on how sequences are implemented in the database, always hitting the same sequence can be better or worse. When only a few or only one thread request new values, there will be no locking issues. But a bad implementation could cause congestion.
Another problem is rolling back transactions: Sequences don't get rolled back (because someone else might have requested a higher value already), so you can have large gaps which will eat your number space much more quickly than you might expect. OTOH, it will take some time to eat 2 or 4 billion IDs (if you "only" use 32 bit (signed) ints), so it's rarely an issue in practice.
Lastly, you can't easily reset the sequence if you have to. But if you need to have a restarting sequence (say, number of records since midnight), you can tell Hibernate to create/use a second sequence.
A major advantage is that you can uniquely identify objects anywhere in the DB just by the ID. That means you can severely cut down the log information you write in the production system and still find something if you only have the ID.
There could be performance issues with all code getting values from a single sequence - see this Ask Tom thread.
I prefer having one sequence per table. This comes from one general observation: Some tables ("master tables") have a relatively small row count and have to be kept "forever". For example, the customer table in an ERP.
In other tables ("transaction tables"), many rows are generated perpetually, but after some time, those rows can be archived (or simply deleted). The most extreme example is a tracing table used for debugging purposes; it might grow by hundreds of rows per second, but each row is obsolete after a few days.
Small IDs in the master tables make it easier when working directly on the database, e.g. for debugging purposes.
select * from orders where customerid=415
vs
select * from orders where customerid=89461836571
But this is only a minor issue. The bigger issue is cycling. If you use one sequence for all tables, you simply cannot let it restart. With one sequence per table, you can restart the sequences for the transaction tables when you have archived or deleted the old data. Master tables hardly ever have that problem, since they grow much slower.
I see little value in having only one sequence for all tables. The arguments told so far do not convince me.
There are a couple of disadvantages of using a single sequence:-
reduced concurrency. Handing out the next sequence value involves synchronisation. In practice, I do not think this is likely to be a big problem
Oracle has special code when maintaining btree indexes to detect monotonically increasing values and balance the tree approriately
The CBO might have a better time estimating range queries on the index (if you ever did this) if most values were filled in
An advantage might be that you can determine the order of inserts amongst different tables.
Certainly there are pros and cons to the one-sequence versus one-sequence-per-table approach. Personally I find the ability to assign a truly unique identifier to a row, making each id column a uuid, to be enough of a benefit to outweigh any disadvantages. As Aaron D. succinctly writes:
you can uniquely identify objects anywhere in the DB just by the ID
And, for most applications, due to the way Hibernate3 batches IMPORT statements, this will not be a performance bottleneck unless massive amounts of records are vying for the same db resource (SELECT hibernate_sequence.nextval FROM dual).
Also, this sequence mapping is not supported in the latest release (1.2) of Grails. Though it was supported in Grails 1.1 (!). It now requires subclassing one of the Hibernate dialect classes as a workaround.
For those using Grails/GORM, have a look at this JIRA entry:
Oracle Sequence mappings ignored

Ways to avoid eager spool operations on SQL Server

I have an ETL process that involves a stored procedure that makes heavy use of SELECT INTO statements (minimally logged and therefore faster as they generate less log traffic). Of the batch of work that takes place in one particular stored the stored procedure several of the most expensive operations are eager spools that appear to just buffer the query results and then copy them into the table just being made.
The MSDN documentation on eager spools is quite sparse. Does anyone have a deeper insight into whether these are really necessary (and under what circumstances)? I have a few theories that may or may not make sense, but no success in eliminating these from the queries.
The .sqlplan files are quite large (160kb) so I guess it's probably not reasonable to post them directly to a forum.
So, here are some theories that may be amenable to specific answers:
The query uses some UDFs for data transformation, such as parsing formatted dates. Does this data transformation necessitate the use of eager spools to allocate sensible types (e.g. varchar lengths) to the table before it constructs it?
As an extension of the question above, does anyone have a deeper view of what does or does not drive this operation in a query?
My understanding of spooling is that it's a bit of a red herring on your execution plan. Yes, it accounts for a lot of your query cost, but it's actually an optimization that SQL Server undertakes automatically so that it can avoid costly rescanning. If you were to avoid spooling, the cost of the execution tree it sits on will go up and almost certainly the cost of the whole query would increase. I don't have any particular insight into what in particular might cause the database's query optimizer to parse the execution that way, especially without seeing the SQL code, but you're probably better off trusting its behavior.
However, that doesn't mean your execution plan can't be optimized, depending on exactly what you're up to and how volatile your source data is. When you're doing a SELECT INTO, you'll often see spooling items on your execution plan, and it can be related to read isolation. If it's appropriate for your particular situation, you might try just lowering the transaction isolation level to something less costly, and/or using the NOLOCK hint. I've found in complicated performance-critical queries that NOLOCK, if safe and appropriate for your data, can vastly increase the speed of query execution even when there doesn't seem to be any reason it should.
In this situation, if you try READ UNCOMMITTED or the NOLOCK hint, you may be able to eliminate some of the Spools. (Obviously you don't want to do this if it's likely to land you in an inconsistent state, but everyone's data isolation requirements are different). The TOP operator and the OR operator can occasionally cause spooling, but I doubt you're doing any of those in an ETL process...
You're right in saying that your UDFs could also be the culprit. If you're only using each UDF once, it would be an interesting experiment to try putting them inline to see if you get a large performance benefit. (And if you can't figure out a way to write them inline with the query, that's probably why they might be causing spooling).
One last thing I would look at is that, if you're doing any joins that can be re-ordered, try using a hint to force the join order to happen in what you know to be the most selective order. That's a bit of a reach but it doesn't hurt to try it if you're already stuck optimizing.

cqrs query performance

I'd like to know when you should consider using multiple table in your query store.
For example, consider the problem where a product has it's description changed. This change could potentially have a massive impact on the synchronisation of the read only query store if you had many aggregates that included the product description.
At which point should you consider a slight normalization of the data to avoid lengthy synchronisation issues? Is this a no-no or aan cceptable compromise?
Thanks,
CQRS is not about using table-per-view, rather table-per-view is an aspect of a system that CQRS makes easier.
It's up to you and depends on your specific context and needs. I would look at it this way, what is the cost of the eventual consistency of that query vs. the need for high query performance. You may want to consider the following two characteristics of your system:
1) The avg. consistency of that command, i.e., how long it takes to update all of the read models affected by the command (also consider whether an optimized stored-proc for the change would outperform say using an ORM or other abstraction to update your database in this way).
My guess is unless you are talking millions, upon millions of records the consistency here is sufficient to meet your requirements and user expectations for consistency, maybe a few seconds.
2) The importance of query performance. How many queries are you getting per second? Can you handle doing a SQL join every time?
In most practical scenarios the optimization of either of these things is moot. You can probably do the update, regardless of records, using a good SP in seconds which is more than enough consistency for a UI refresh (keep in mind the UI that issued the command can be consistent as soon as they know the command succeeded).
And you usually don't need so much query scaling in a system that a single join will hurt you. What you may not want is the added internal complexity of performing these joins in your code and stored procs.
As with all things in CQRS, you don't need to use and optimize every aspect of it from day one. You can optimize these things incrementally. Use joins today, and fully denormalize tomorrow, or vice-versa.

What are locking, deadlocking issues in financial operations?

Subquestioning SQL - when should you use “with (nolock)”
In one local financial institution I was rebuked by their programmers for expressing them my opinion that (their programmers' obsession with) (b)locking issues in their MS SQL Server 2005 database(s) did not make much sense to me.
What are the possible issues with possible locking, blocking, deadlocking if financial operations are never updated, deleted and even incorrect operations (part of "transactions"?) are corrected by inserting (adding) new correcting records into database(s)?
What is the term for this in English? In other languages it is called storno, stornoed (?) operations/records.
So, as I understand, the "transactions" are really never rolled back and there are never incorrect/non-existent records, only non-actualized ones.
Update:
I googled for storno and could not find any results with its definition in English or its use in English texts.
I found definition for storno (in latin letters) only in Italian.
But accounting was invented in Italia and many Italian accounting terms are used in other languages, for ex., in Russian accountance (banking).
I also thought that it was internationally accepted practice in financial operations accountance, isn't it?
Update2:
S.Lott gave me link to The way that transactions are reversed in an ERP application is a big deal! telling that storn is reversal transaction.
Well, this is not correct. Storno is not only transation, it is any operation (part of transaction) correcting incorrect operation though 2 operations combined might seem to reverse tranaction (consisting of 2 operations - of crediting and debiting on target and source accounts).
So, storno is not common financial accountancy practice through the world?
Anyway, I'd like to avoid discussion of accountancy details/techniques/terms and to restrict the question to context when records are never deleted or updated.
What are the possible problems with locking, blocking, deadlocking, performance in this context?
"Storno Transactions" or "Reversing Transactions" are summarized nicely. In lots of places.
http://richardatopenbravo.blogspot.com/2010/02/way-that-transactions-are-reversed-in.html
http://help.sap.com/saphelp_46b/helpdata/en/d2/6f921f415e11d182b10000e829fbfe/content.htm
http://forum.wordreference.com/showthread.php?t=1875166
Don't conflate software implementation with accounting. A reasonable implementation can get by with minimal locking. That doesn't mean anything, however. You may have earned a rebuke because the software is (a) badly designed and (b) requires careful locking because of poor design.
What are the possible issues...?
Since we don't know how well or how poorly the software is written, it's impossible to guess. They may know something about their system that you didn't know.
A simple storno transaction system should be easy to implement. Indeed, it should be trivial.
A pair of "insert-only" tables can still encounter deadlocks if page-level locking is used during the inserts.
Table A, page 1 has an insert in transaction X.
Table B, page 2 has an insert in transaction Y.
Table B, page 2 has an insert in transaction X.
Table A, page 1 has an insert in transaction Y.
The only way to avoid deadlocks is to have the entire system use a single table. Or have all transactions limited to a single table. Or use a single database-wide lock.
If you have multiple-table operations (and page-level locking) then you will still have potential deadlocks even with insert-only operations. Clearly it's rare, but still possible.

Resources