Transcript of Greg Young's Talk at Code on the Beach 2014: CQRS and Event Sourcing
This talk, given by Greg Young in 2014, has become an essential piece of content for all that want to learn more about CQRS and Event Sourcing. It has become a part of most training guides and many new starters to companies that use Event Sourcing are encouraged to watch it. It neatly describes the evolution from CQRS to Event Sourcing, covers CQS and details just how far back the core concept of an append-only, immutable log goes in history. This video helped establish the use of an Event Store as useful tool for financial records, with the now iconic slide of the ledger with an entry being erased. The video also explains that there are other business domains that can benefit of having the history of business actions as the source of truth. The slides have been redrawn here for clarity, with permission from Greg.
This video is an important artefact in the history of Event Sourcing, and it's important to make the content accessible to all. Take the time to watch the video here and refer to the transcript here when needed.
Okay guys we're gonna get started. Okay so I want everyone to stand up really quick… And now I know you can take directions so sit down. Listen, for those who don't know me, my name is Greg Young. My day job is I design a database; as in building the actual database engine. I normally do not wear shorts and t-shirts around but we are on the beach so that's okay.
What we're going to talk about today is CQRS and Event Sourcing, and this is actually a big problem, just on the slide, everyone always talks about CQRS and Event Sourcing, when really it's Event Sourcing and CQRS. When I first started teaching people about CQRS and Event Sourcing it was advantageous to teach them CQRS first and then teach them Event Sourcing. You can use CQRS without Event Sourcing but with Event Sourcing you must use CQRS, and we're talking about what both of those concepts are.
I've been talking about this subject for a very, very long time, in fact the first time I ever did talk about this was in 2006 at QCon San Francisco, and actually some of the slides are still the same from all the way back then. I have probably done this or a similar talk 50 times over the last couple years. When I did the first talk, actually it was awful. So I come in and I've never really talked at conferences or anything like that before, and I'm talking about domain driven design and messaging systems and these new patterns. And I come in, in my front row is Martin Fowler, Eric Evans and Gregor Hohpe. I'd never met any of them before. I think I went through my entire 60 minute talk in something like 17 seconds! We will not do that today.
Now when we first started looking at things like Event Sourcing, and where the ideas came from, they're not new ideas. In looking for the book and trying to do [the] history about Event Sourcing, I have managed to bring it all the way back to Mesopotamian Sumeria, so if you come to me afterwards and say “how do I convince my company to use these new technologies?” they're not new, people have been doing this for decades. In fact, when I used to work with mainframes before SQL databases got popular, this is actually how we tended to build systems. Your database actually works with Event Sourcing internally, but at the time I was working with algorithmic trading, and when you're working with algorithmic trading you really, really need a safety net. You need everything in your system to be deterministic so you can go back to any point in time and figure out exactly what was going on. Why? Because what happens when you're doing 500 trades per second and your algorithm decides it wants to find places it can lose one penny at a time? You can find a whole lot of places to lose money in the stock market. The other thing that we need is we need to be able to look and do scientific measurements over time, and we may be able to compare time periods to time periods, maybe do this in a deterministic way, but the biggest thing that we needed was an audit log.
How many of you have an audit log today for your system? Raise your hand. Okay, now keep your hand up if you can prove that your audit log is correct. What's the point of your audit log if you can't prove that it's correct? I know some characters that had this problem...
Hansel and Gretel were walking through the forest and they got lost, so they started leaving an audit log... Correct, they started leaving little pieces of bread so they would know where they were and they'd be able to get back. And fortunately, little woodland critters were coming behind them and eating the bread. They couldn't prove their audit log was correct. I would argue they would have been better off in the story if they had not had an audit log at all because, well, they would have had the piece of bread when they got to the house, they wouldn’t have been hungry, they never would have gone in, they wouldn't ended up in the oven. Which is the real ending of the story, if you read the real story. But this is a big problem: we have audit logs we can't prove are correct, and this is one of the main things we are trying to solve, [is] we need an audit log that we could prove was actually correct.
Now there's a lot of industries who need to do this: anything regulated you need to be able to prove the audit log is actually what happened. We're gonna go through some cases, real cases, where people failed on this and how you can actually solve them. And [what] we came up with was that every state transition inside of your domain, how many have heard of domain driven design? So when I say domain I'm referring to a domain model similar to what's in domain driven design. An object model we can imagine, but it may be functional as well. When we talk about a state transition that occurs inside of that model, that should be a first class concept inside of your system; it's not that “update table set some column equals value”, it's a fact. And a fact is what should come out of your domain model. Customer, status level, promoted, that's very different than “update table set column equal”, and if you start modelling like this a lot of your problems will actually go away.
What's most interesting for me about Event Sourcing is that not a single mature industry uses the concept of current state. None. Now when I say “mature”, I do not mean Groupon. I'm thinking: finance, banking, insurance, things that have been around for hundreds of years, not a single one of them has the concept of current state inside of it. How many of you have a bank account? Do you think your balance is a column in a table, or is your balance an equation? Your balance is a summation of all the previous transactions value upon your account. What would happen if it were just a column in a table? So you have a disagreement with your bank about what your balance is, you call them up on the phone and say “yes it says I've got 103 dollars but I think I should have 207.” And they go “all hail the mighty column! The column says 103, that's what you have”. But today you can call up and you can say “well I can add up all the things in your account, and we can check whether or not it's actually correct”. Your balance is a first level derivative off of the facts on your account, and we can do this with all forms of current state, it's not just about things like bank account balances. You can do it to any business problem that you will ever come across. We're talking about this more in a minute, but there's other reasons why they don't keep current state, and instead its current state is transient; it's a first level derivative.
This is my quintessential structural model. How many of you have a structural model? Today it could be in SQL, it could be in a document database, it could be in an XML file. What it describes is the structure of a piece of data, and this is what we are told to find when we are in University, right? You just got to find the structure of the data. Structure of data isn't that important.
Now let's take this, this is not the only way that I could represent this particular piece of information. I can also represent the same information as a series of events, facts, so now we have ‘cart created, three items added’ and ‘shipping information’ added, which doesn't fit quite right at this resolution. This is actually five events. When I tried to put the three items added, the boxes got really little and you couldn't read them any more. But I could, at any point in time, take these five events, I could run them through a piece of code, a function, and I could pass you back that, correct? I could take my five events that are saved, and transform them to give you back a structural model.
And this is what people are doing when they're doing Event Sourcing. Event Sourcing is all about the storing of facts, and any time that you have state structural models there are first level derivative off of your facts, and they are transient. When I say transient, it does not mean in-memory. Transient means I am willing to delete them. I make them persistent, but I could at any point in time delete them and rebuild it. Could I delete this and then replay those again? Imagine that: anything in your system you can just delete and replay and get back. There's a lot of differences between these two. How many of you have refactor a domain object before? Renamed something on it? And then you have to go make your SQL migration scripts, correct? What happens when I rename something here, and I'm storing this? When I restart my server, do I need to run a SQL migration script? No. More importantly, what happens when I have two servers that are running side-by-side: one's running version one, the other one is running version two. Could each server have its own view of what these facts are? So one of them has this model, another one has a slightly different model. Sure there's no problem at all with this. When we start looking at this, our facts come back down to our use cases, and one thing you will learn over time is that structure changes more often than behaviour. Your use cases of a system tend to be reasonably stable over long periods of time. How you interpret your internal data, the structure that you use, has a tendency of changing a lot. But we haven't gotten to the reason why most businesses that are mature are using this, and this slide (I've been using, literally, since 2006!) this slide took me forever to find, and the reason why, is you have an accountant erasing something in the middle of their journal.
Accountants do not do this unless they work for Enron. You do not erase something in the middle of your ledger. This is highly illegal. If you took in a class in accounting, you were probably told that accountants don't use pencils: they use pens, and it's the same thing [when] we talk about event sourcing. When we talk about event sourcing you can never ever update an event and you can never delete an event. To be fair, there are some circumstances where you want to delete events over long periods of time, and for instance in our database, we support doing that. Conceptually, you never do it, and I know some people in the room are probably going “but that data must get very big over time”. How many of you are doing more than 1 million transactions per day? 5 million? 10 million? So I'll give you guys a hint, if I can take all of your data and I can put it on a microSD, you don't have big data. If I can put all of your data on a microSD, you do not need to worry about ever deleting any data, providing your system has been running for more than, like, a year. What you need to consider is where you sit relative to Moore's law. Are you gaining data faster or slower than that exponential curve? And, to be fair, I forget the name of the guy, it's another law associated to Moore's law, that actually claims data growth is actually faster than Moore's law, by a little bit. If you are not going faster than Moore's law, then you don't have to worry about the amount of data that you have. Your data will just continually get cheaper and cheaper and cheaper for you to store.
If I were to look three years ago, how much would it cost me to get a one terabyte SSD? Or one terabyte of SSD? Today for, what, five hundred dollars, I can go get a Samsung 840 M SATA one terabyte SSD drive. [The] thing will do over a hundred thousand IOPS. Where do you think we'll be in three years? Where do you think we'll be in five years? So for most of you, do not worry about deleting events, you probably don't need to, and we're going to talk more about this.
But, accountants don't do this. What do accounts do? Accountants, if they transfer 10,000 to your account by accident and they meant to transfer 1,000, will do one of two things: they will either take 9,000 back, this is known as a partial reversal, and accountants don't like doing this. Okay, sure, if it's 10,000 with a thousand coming back, it's not too big of a deal, right? What did I intend to do?
[Voice offscreen] Correct the mistake.
Yeah, but what was the amount I intended to send them? Ten thousand, nine thousand, okay you guys can figure it out pretty quick. What if there are six accounts involved and they're not perfect, even numbers? Now, you're gonna be getting out of pencil and paper and trying to actually work out what was originally intended by me. So accountants don't like doing partial reversals because [of] the pain for auditors, who are actually reading through the books. They tend to do what's known as a full reversal. And in a full reversal I give you 10,000, I take 10,000 back, and then I say here's the 1,000 that I originally intended to do, so now the auditor going through goes “oh that's a cancellation and here's what I intended to do”. All of these same things apply in event sourced systems. Just like in a ledger, you are not allowed to ever go back and update something, you can only add new things.
You can do corrections; let's take an example. So I got a cart, created three items, added one item removed, (which is also screwed up because of the resolution) and then shipping information added (which is screwed up because of the resolution) or maybe it's “shipping informatee on added”. Now my question for you guys, is this the same as “cart created, two items added, shipping informatee on added”? I see some people going “yes” and some people going “no”. It depends on my perception; if I were to go and look at it with this perception would those to be the same? Yes, but what if I looked at it from a perception of “I want to count how many items were removed”? So I have: “product code, number of times removed”, would those come out the same? No, and this is one of the things that really leads to people using Event Sourcing, and in most places I find it's the business that drives the use of Event Sourcing, not the technology reasons. And the reason why people are using it, is it's the only model that does not lose information. No matter what other model you choose, any structural model, I don't care what it looks like, unless your structural model is actually an implementation of Event Sourcing, like many accounting systems do where they have a transaction table with a transaction type, which they then join out to another table that holds the fact, any structural model like this will lose information.
How many of you have an update or a delete statement in your system? Okay, keep your hands up. How many of you sat and talked with the CEO and Board of Directors of your company about how that data had no value? How many of you can predict where your company will be a year from now, and what they may ask you about today? Do you have a magic 8-ball? What will the company ask me? Will they ask me about this? I think not. So how did you make this decision to destroy data? You personally didn't feel it was valuable or maybe you didn't think about it. Data is massively, massively valuable and anytime you choose one of these you are losing data. The fun part is figuring out what data you're losing. How many of you have had a businessperson come to you and ask you for a feature on things that you should have had, because you've been doing this behaviour for a long time, but you lost the information? You're updating a table, then they ask you about the history of that particular value, as an example. We're losing information here, let's go through and look at this with a real business use case, and show how we're losing information here. So our business user comes in and says “you know what, I want a new report. And this report, what I want it to say, I want to look at how many people are removing items within five minutes before they check out in our system because I think they're more likely to buy those items in the future than they are the other things that we show them”. Why? When do you remove an item from your cart five minutes before you check out, I can tell you when I do it. So I go to my cart and I'm looking, it's like $350, and I'm like “so I've got a choice here, I can either remove two or three of these items from my cart and get the rest shipped to me, or I can get them all shipped to me and my wife will have my head”. So the choice is “everything, no head”, or “some things with head”, and normally I choose the “some things with head”. And it doesn't mean I don't want those things anymore, it means that I don't want them right now, I prefer to have my head. I am prioritizing them lower than my head. So in this system how would we do this report? Maybe we'll add a new thing off the top called “removed line items”, then we’re gonna write a report, which you'll go to my “removed line items” and then it will do a subquery to see if I've actually bought that item in the future. Okay, we released to production, the business guy goes and runs the report. What does he see?
[Voice offscreen] Nothing.
Nothing. That report applies to right now, forward in time. Okay, maybe he gets one thing, because there was a guy that was really fast and he... he ordered a bunch of stuff, then he immediately put the stuff in his cart again and ordered again, so he gets one thing. Let's try it: same report, but let's do it in an event source model. Now I haven't talked about the name for this thing yet, but it's called a projection. A projection is some code that goes over a series of events and produces some form of transient state off of it. And projections are very useful. A projection could go to a SQL server, a projection could go to Neo4j, a projection could go to your in-memory domain object, all of them! They're just little bits of projected state off of event streams.
So what we're gonna do here, is we're gonna write a projection, and what I’m gonna look for is ‘item removed’ and in my little state, as I'm going over the event stream, I'm going to say “if I find an item removed, put it in with the ID of the item and the time of the removal. When I get the check out, that's when they were checking out the shipping information added (sorry, the shipping informatee on added), when I get that, look to see if any of the removed items were within that five-minute window. If they were, mark searching for them in the future as true”. So I put them into a little thing, let's say a map. These are the items I'm looking for. Found equals false, and then as I go forward in the event stream, if I see someone actually had bought that item in the future, mark it as true.
What I haven't told you, is when you run a projection it must go all the way back to the very first event that your system wrote, and has to come forward from there until it hits right now. So, we run this, maybe it takes a weekend. Monday morning, come into the office, we go and run our report. What do we see?
[Voice offscreen] Everything.
We don't only see everything; not only can I see everything at that point, I can also tell you at any point in time what this report would have told you if we had this report at that point in time. Imagine that! So the business expert comes in, he says “you know, that's great that you've showed me today, but what if I had this report on July 28th 1993 at 4:14 in the afternoon, what would it have told me? And you can say “sure I can show you that. I can show you your report at any point in time during the existence of the system.”
This is why people like Event Sourcing from a business perspective: it allows you to time travel, and this time travel is immensely valuable. But there's some other things about event sourcing, and we can start discussing a little bit the program or “pornography” involved with it. How many have bought a hard drive before?
Have you ever noticed that there's two speeds that they tell you about writing? One is random, the other is sequential. how would you write an event source system? It's sequential, right? Which of those two are faster? There's lots of things about Event Sourcing from a developer perspective, that are nice. So if you can never, ever change an event and I were to expose it over HTTP why would I set the cache ability to? How do you scale immutable data? Well, you just make copies of it right? Because you never have to worry about any synchronization due to change, it's a very easy model to scale up. Well, there's other things we can run into. I know none of you guys would do this, but how many of you have had a junior put a bug into the code that you work on, and then you had to fix it? Never mind that the junior was you yesterday. The worst ones are the ones where you have a user that's using the system and the users got the system into a state where some weird stuff is happening, and then they get the system out of that state. So they're working with it and something's going wrong, and then they manage to get out of that state and then things are working again, and they call you up and say “your system was broken but I fixed it”. Then you go “ah okay, thanks for the bug report we'll try to reproduce this, what did you do to get there?” “I don't know”.
What if we were using an event-sourced system, could I go back? Let's imagine I saved all the commands coming to my system. I could find Joe. What commands is Joe issuing? I could go down and I could read the commands that he was actually issuing against the system, and I could see the results going back to him, and I've got a special little console app and what it allows me to do works just like my normal domain model in production, the only difference is when you use this one, you can tell it to run at a version of the event stream. In other words, if I were to go look at my event stream here, you can imagine I was having a problem at the second item added. So I tell it “only load event two, don't load the entire event stream”, and process this command. And now, I can step through with my debugger in the code, exactly as the code was running in production when he had this issue. Was it a bug? By the way, where do I get my unit tests from if it's a bug? It basically gives me my unit test to start with. But there’s lots of other cool things that you can do.
How many of you do smoke testing? Smoke testing is a series of tests before you go to production, that make it look like real usage. It's not unit test, it's not integration testing, it's a form of integration testing I guess. What I used to like to do, I used to rerun every week, every command my system ever processed in its history, and I would compare my log this week compared to last week, what changed? Are these things I expected to change, or these things I did not expect to change? Now, this will not save you from black swans, nothing will save you from black swans, but you should feel reasonably comfortable if you've rerun every command your system has ever processed through the new software with no unexpected results. You can't ask for much more than that. There's some other interesting aspects that we can get into.
How many have heard of something called a super user attack before? A super user attack is defined as a rogue developer or system administrator with root access who decides to attack. How do you stop a super user attack? Sounds like it might be a small issue. The guys got root [access], he could delete stuff, right? I actually ran into a super user attack working here in America. I[t] was actually one of my first jobs out of University, I was working on gambling systems. In particular, we were... we were doing horse gambling, pari mutuel wagering. And we have a superuser attack going on, so we had a pool, it was called a pick six. In fact it’s the “ultra pick six” but we'll leave it just a pick six. So in a pick six you have to pick the winner of six horse races. It's hard enough to pick one, but to pick six in a row, it's a pretty big pool usually. And what we would do is we would hold the bets because most people are making combinatorial bets, it's very, very big tickets. And we don't want to have to send all of those over our old serial port communications back to the central site, so what we do is, we'd hold them at the remote sites until after the fourth race and then we would send only the ones that were still possibly winners to the host track. And we had a guy... and there's actually an HBO special on this, it's called ‘Criminal Masterminds’. For me, I've never understood it because all the people on there got caught, so they're obviously not the masterminds. And the guy was rigging with... his name's Chris Harn by the way, and there is a television show about this there's Wikipedia pages, it was the largest gambling scandal in America in the last hundred years.
So what they were doing, was they would go through and they'd put in a bet. Let's say it, Catskills are or at Yonkers one of these smaller OTBs and they put in a bet: one, two, three, four, all, all. So in the first I want the one, in the second I want the two, in the third I want the three, in the fourth I want the four, and then I went all in the fifth, and all in the sixth. And then they would watch the races on television and he would get on the maintenance line, eject the tape, go edit the bet manually on disk and then our scan would hit. When a scan would come through, and it was no longer one, two, three, four, all, it was the first four winners all, so it’d come through and it'd be like “wow that's interesting, look this is a possible winner, oh wait it's a guaranteed winner has all, all at the end”. And he got caught but like most super user attacks he did not get caught because he was stupid, he got caught because he was unlucky. In the fifth and sixth legs, it was like, a forty three to one horse and a fifty six to one horse won, and they were essentially like the only winning tickets in the world, and this is on Breeders Cup day. It was like a two or three million dollar winning ticket. Now, you can be damn sure when you've got 150 winners, who are each getting paid out two hundred thousand dollars, no one's going to really look at what happens. When there's basically one ticket in the world with these long shots, and then you look at the bet, and it's like one, two, three, four, all, all? No punter would ever bet that ticket because it's a combinatorial bet: if any of those first four horses lose your entire bet is gone. So they started looking at it, they're like “well, isn't a bit fishy? Well what's going on over there at Catskills? Let's get them on the phone. Oh, there was a developer on the maintenance line, this is interesting!” Apparently, the FBI picked him up the next day, and he went to federal prison for a while after having talked, giving up other people that he was dealing with, and if I remember correctly from talking with people I worked with, just maybe a year ago I believe, he's now living in Peru.
But this could have been prevented on an event-sourced system. How many have heard of something called a worm drive? Right? Once read, many. You physically cannot overwrite data on it. In other words, it guarantees the immutability of my events. And then what I'm going to do is, I'm going to take away your physical access to the machine, which I don't even need to do that, I can still give you physical access, we'll talk about how, but I'm going, and we're good at that. Security is good at physical access, keeping people away. Now, my current state is derived from my audit log: anytime you asked me about current state. I get it off of my audit log. Anytime you want to make a change I write it to my audit log, and writing it to my audit log is what makes it become part of my current state. My audit log is on a worm drive. How do you attack me?
[Voice offscreen] bad data
Ah, you can add bad data, but I still got a log of your bad data. Now, there is a thing that you can do here, if I'm not very careful, what you could do if I'm a slow-moving system, you could make an entire copy of my worm drive with everything except for the changes that you want, and then switch the worm drives, right? But I can avoid this. I can do a write to my worm drive, just basically a little heartbeat right? I can do one every 250 milliseconds so all you have to do is keep up to date and move them fast enough, and what I'm going to do in my heartbeat, is I'm basically going to have a sequence that's coming off of some cryptic, cryptographic algorithm that uses the last value, plus something in order to get the next value, based on the time, something that you won't be able to reproduce during that 250 milliseconds while you're switching the cables. At this point I'm going to know that you switched the drive, and I can actually detect that and show you switched the drive on me. Now I'm not saying that this is the reason that you shouldn't use Event Sourcing, by the way. If you happen to be in a heavily regulated environment, keep this in mind. Your regulators will love you if you have that kind of strategy.
Now I always get the question “what happens when we have lots of events?” Now normally we do not event source a system, it's not one stream of events for the entire system that we replay. Instead, what we do, if people know domain driven design, we have an event stream per aggregate. So that way I only need to replay, let's say 20 events worth. But some aggregates, they can get big. Conceptually, we start from the beginning and we go to the end. And by the way, for those who do not know a domain driven design aggregate, “document” is another good way of thinking about it. If you've used a document database, every document represents a stream of events, and there's basically a partition point in the system. Now conceptually, we always go from the beginning to the end, but what if I had one of my aggregates or documents as Google's order book for the stock market, how many events do you think happened in Google's order book in the stock market throughout a day of trading? So at 3:30 in the afternoon something goes down and I need to replay it. Do you think that's, like, five events in there, do you think it's like five million? Replaying five million events to get back an order book sounds like it will take a little while. And it's an expensive operation; I probably don't want to do that. But there's another trick that we can use, and this is called a rolling snapshot. What happens with a rolling snapshot is, at points in time I snapshot the state as it was at that point in time for that projection, and now instead of having to replay all of the events you only have to replay from the snapshot forward. By the way, all of this comes back to functional programming as well. Event Sourcing is a functional data storage mechanism. How many of you have done functional programming before? Okay, for all of you guys I'm going to simplify Event Sourcing for you. Current state is a left fold of previous behaviours. Simple! A snapshot is a memorization of your left fold, nothing more. A snapshot is just me saying that at event 4 this is what the state of this projection was. How often do our events change? Never: they're immutable. So when would that snapshot ever go bad at event number four? Never.
[Voice offscreen] What if I miscalculate it?
If I miscalculate it, but that's not the same projection, that's now a different projection. By the way, I'll just add this: you can never change a projection, you can only create a new projection. There's no such thing as editing a projection: if you're editing a projection it needs to go all the way back to event zero, and come all the way forward, correct? You can never edit a projection.
Now here, instead of going from the beginning to the end, I go from the end to the beginning. Number six, are you a snapshot? Nope. Number five, are you a snapshot? Nope. Snapshot are you in snapshot? Yes. Now go forward and basically from the snapshot, go forward until you hit the end of the stream. But you have to be very, very careful with snapshots. First of all, never implement them like this. This looks really nice in a slide, but it's utter crap inside of your system. There's a really subtle problem here. So let's say that we're at version four right now, and I'm currently taking a snapshot, then I go to write down my snapshot, but then he writes version five. Well then I get an optimistic concurrency exception, right? So now I'm making a snapshot, and I get to writing it down, and she writes version six, optimistic concurrency exception. What happens when we're getting events, let's say every 20 to 50 milliseconds, being written to the stream; will I ever be able to take a snapshot? Eventually it'll succeed, likely, on some probability. So let me ask a fundamental question: when I took that snapshot at version four, is it any less valid at version four because he wrote version five? It's still perfectly valid at version four, so why don't I just write it off on the side and say “this is a snapshot at version four”. And when you read the snapshot, then you come back in and you basically point, it points back to version four.
But be very very careful with snapshots: how many of you have dealt with migrating a SQL database before? Who enjoys it? So now we got the migration script, we need to get it out before we do the release. I never do that anymore by the way, I always do side by side releases. I never do a big bang release. The reason why is I'm scared shitless of it. I am not scared of going and releasing my software and then I bring it up and nothing works. Why? because I know I've got a good strategy to deal with that: it's called roll back. Roll back to the old version, we'll figure everything out tomorrow, we'll try again. What worries me, is I’ll roll out and everything's going to work for a week, and then it's going to blow up. So how many of you have written a script to bring a database forward a version for a piece of software? How many of you write the script to do the opposite, to take all the data written in the new schema and convert it back into the old schema, so that you don't lose it? So on teams I've worked on, we had a strategy here, when you had this problem of it running for a week and then blowing up, you had your choice: you can either wear the fireman hat or you could wear the cowboy hat. It's a good strategy: Oh come on, how many of you have worked on a production issue before? And so you are in the weeds in production, and as someone wants to come up and start talking about the Christmas party, and you want to turn around and strangle them. The idea is if every one of the teams know if you are wearing the cowboy hat or the fireman hat, they'll basically walk in your office and just go “okay... you don't need my bothering you right now”. And it's done in a very subtle way as opposed to, you know, strangling people.
Now, what's most interesting for me about Event Sourcing, so there's a lot of domains that are naturally event sourced. I forget the name of the company that's out here now… Availity, they work in one of them. How many been [to] a doctor before? When you go to the doctor, does he... do you walk in, he takes a picture of you, puts the picture into your folder and throws away the old picture? Or does he fill out forms and continually append them to the end of your file? And we wonder why doctors have a hard time understanding CRUD-based systems when their natural mental model is the appending of facts. Sound familiar? There's a lot of industries like this. How many of you have ever worked with lawyers? I'm not saying, like, all these age old businesses, they're all event sourced, so if I have a contract and we decide we're going to make a change to it, we just go in and edit the text right in the middle of contract, right? Or do we do an addendum? And the only way you can know what a contract actually says is, you take the original contract and start applying all the addendums on top of it, sound familiar? There are a vast number of business problems that are naturally event sourced, and if you use event sourcing in these kinds of systems, oddly, everyone will understand what you're doing.
Overall, event sourcing is a beautiful, beautiful transactional model. It's append-only, it's immutable. How many of you do more than 100 transactions per second? One, two, three... Okay, let's try a thousand. Okay, so... on this laptop, which is... it's not even a current generation laptop, it's actually version last, I don't know if anyone's looked at the ThinkPad X1, that completely destroyed the keyboard, so I will not buy the current version. Not really, they got rid of, like, all the F keys and replaced them with, like, LED pictures. Looks like a McDonald's cash register. So this is like a two year old laptop now. I can push between 15 and 20,000 TPS on a two-year old laptop that was built to be light and portable, not for power. On a reasonable server, let's say, about, $1500 server, you can pretty easily push 30 to 50,000 transactions per second, and it's good for these kinds of systems, but I'm not going to tell you that's the only thing that you should be looking at. It's the business reasons you should be looking at. Append-only, immutable logs are absolutely brilliant for a lot of things, and they are the ideal transactional model.
But there's a slight problem with them. So if I event-sourced my system, how do you answer the question “I want to see all the users with the first name Greg”. Do I replay the entire event log for every user in the system to figure out at the end of it if their first name is Greg? That’d kind of suck, wouldn’t it? Let's see, that that's going to be Big O of n. Yeah, we got a million users. Let's compare that to doing a search on a binary tree, log in. Oh, this is going to be spectacular. Basically, imagine that every single query you do has to be a table scan of every event your system has ever done. It would be awesome, and this is where we come to CQRS.
CQRS basically says that you don't want one system: reading and writing are different, and you should make different decisions for reads and for writes. CQRS at its core, is probably the dumbest pattern ever imagined. So CQRS actually comes from CQS: command–query separation, which is from Bertrand Meyer, who's a very very interesting guy by the way, if you haven't read his work I highly recommend it. ‘Object-Oriented Software Construction’ I would recommend getting, probably, version 2 of the second edition. I will warn you his book seems to be event-sourced: it looks like it's append-only and he just keeps adding to it on every addition. If you get the third edition I believe it's 1,300 pages, but on the bright side you will save money because you don't have to go to the gym. And what CQS states is very simple, there are two types of methods. The first type of method that we have has a void return type: it's called a command. If you’re functional code, we could call it a unit return type. It is allowed to mutate state; it is not a pure function. The second type of method that we have, has a non-void return type, it is not allowed to mutate state. It is called a query. You'd be amazed, if you just follow that in your code, how many bugs you'll save yourself from. How many of you run into a problem where you did something like, there is something hitting at the top of a loop, inside the loop calling a getter on an object, you said “I know, I'm going to hoist that out of the loop and that way the code will be faster”, and then the code stopped working and you're like, “what the Hell?! I always did it, getter out of the loop!” And then you go look and the getter was actually mutating state, and you found a bug.
It makes it much, much easier to reason about your code, and the reason that it was so important to do this inside of Eiffel, which is the language that he writes, is because they have something called contracts. And contracts are normally written on top of pure functions, and what happens if I call your contract twice, should that somehow alter your behaviour? That'd be weird. What if I realized I don't need to call your contract, and now you work differently because I didn't call your contract. That's weird.
Now Martin Fowler wrote that CQS is not a principle. It is instead a rather reasonable suggestion. If you know Martin, you can imagine him saying that in his British voice! And he gives a counterexample to CQS? Stack dot pop. Does stack dot pop return something? Does stack dot pop mutate state? Whoops! Now if I wanted to, I could... continue with this, and I could do stack dot pop and then stack dot value but that would seem awkward. Have any of you ever looked at how IEnumerable works in dot net? Move next in dot value. Why not have moved next return something? It's following CQS. And CQS becomes much more important when we start talking about things in a distributed system, and the reason it becomes so much more important, is because we start talking about things like idempotency, where queries are going to be naturally idempotent, commands are not going to be naturally idempotent. CQRS goes one step further than this, and by the way how many of you have seen something on CQRS that said underneath it did you mean cars? And this is because back in 2007 or so if you type CQRS in Google it would say “did you mean cars?” but now after years and I guess enough hits throughout the web, and now actually gives you CQRS links.
CQRS just goes one step further than this, and it basically says, we're going to have two objects now. We're gonna take one object which was filled with commands and queries, and we're gonna make two objects out of it, one with all the commands, one with all the queries. And actually they call this a pattern. How weird is that? It's such a simple concept, but it's an enabling pattern. So let's talk about queries a little bit. When we talk about queries, queries tend to have a different perception of your data, correct? Almost always a query is focused on a screen and what a screen looks like. Does a screen have anything at all to do with managing your transactional invariants? Probably not, if they do have similarities it's accidental. There's no causative relationship between those. Screens, or sorry, queries, are screen related because you want to do one call across the wire to come back because you'll be perceived as being faster then. Another thing about queries in most systems, queries are what you need to scale. Most systems I look at do on the order of 1 to 2 orders of magnitude more queries than they do processing of commands. I've seen them all the way up to 4 or 5 orders of magnitude. How many of you think that you do 1 to 2 orders of magnitude more queries than you do commands? Raise your hands. OK, now I want to get the set going here. Now see how much intersection we have. How many of you have modelled something around a third normal form database?
I've actually been brought in by clients where they were having scalability problems, and you found out they were doing 99 queries for every one command, and they had modelled a third normal form database and couldn't figure out why they were having scaling problems. So 99 percent of your work is reading, so you optimize for write performance? Cool, wonder what the problem is here? But queries are generally what we need to scale, and most people, what they're doing today, they build up, let's say a domain model on top of a database, when it comes to scaling they talk about how to scale everything. We don't need to do that, we can talk about how do we scale our queries, and not talk about scaling our commands. And there’s a really interesting property about queries that makes them especially easy to scale, commands are hard to scale, writes are hard to scale, queries are very easy to scale because almost all queries can operate with relaxed consistency.
In other words, queries can be eventually consistent. When it comes time to process a command I really need my current state in order to be able to do this, do this reliably. It gets very complex if I do it in a eventually consistent way for validation, but queries can almost always be eventually consistent. How many of you make fully consistent queries today? Let's try this another way: how many of you put a pessimistic lock around every query in your system? Guess what: you're already eventually consistent, you just don't know it. So what happens if I query off the SQL server? I was asked by the client to make a DTO for them, so I read up the data, I'm building the DTOs, I go to put them into IIS to send it back, and then he goes and changes the data. Do I have a magical yo-yo packet and IIS that will take that HTTP request, route it back to me so I can change it before I send it back to the client? So you are already eventually consistent, you're just not taking advantage of it. We can start taking advantage of it now, and queries become super, super easy to scale. Then we start looking at a system, it looks a bit like this.
We're gonna have, on one side, which will be the write side, and by the way I do this talk a lot in Europe and I always feel bad for the English-as-a-second-language people, because the write side is on the left and very often they get confused. So on one side we've got our domain objects with application services, normal DDD style stuff, and on the other side we just have this thin read layer that goes directly back to the database, no ORNs or anything crazy like that, and just returns a DTO by querying against the database. This system is very, very easy to scale. How many of these could I make? Can I make ten of them that are the exact same thing? Put a load balancer in front of ten databases, with ten thin read layers and ten remote facades on top of them? So long as your data can fit in one of those databases you are linearly scalable now. You can geographically distribute them, do all sorts of fun things, and it's almost always queries that are the interesting part with this. To be fair, if anyone here happens to work in finance, yes I know you get many, many more writes than reads, there are some systems like that but they're few and far between.
Now, okay, why is that not going to that slide? That's interesting... Okay we'll just leave it like this since it doesn't want to go to that slide in full screen mode for some reason. Awesome. I think that was Impress crashing. You gotta love Linux right? Ah, we can go through these anyway because we got these three slides that we can see here, and we're gonna, we're gonna go to our summary anyway, when we talk about this as well... Oh. Okay, let's try bringing that one back up...
I'll go here and that was lovely. Ah let's see which one do we have? Not that one, not that one, okay, yeah, that one, Okay.
When we talk about this, it's very uncommon that you will have a single read model. Normally you will have multiple read models for doing different kinds of queries, and this one I actually talked about more so last year, which I felt bad because afterwards most of questions I got were more so about basic Event Sourcing, which is why I'm doing a longer talk about basic Event Sourcing this year. But normally you will use many different types of read models, for instance I may have some, a UI that needs to get some data on it, you know it works really well for that a document database where the documents basically aligned what my screens look like, but I may have another thing that wants to do OLAP querying, you know that crap business people want with reporting, and be able to hook up Excel to it. And so why not have an OLAP cube for them? I may want to do full-text indexing, so I may use something like Lucene, but generally I'm gonna have more than one of these, and all of those are just projections. And you can have as many different projections as you want on in an event source system, at any point in time I can always add a new model, correct? It just starts from event zero and comes forward. Who cares if it takes the weekend? Well to be fair, you're probably gonna make it faster than that if you, if you're doing it often, there's lots of tricks to make these come up faster, like, for instance, switching to batch job versus real time. But can I also, if I started seeing that my OLAP cubes are getting overloaded, I can make three OLAP cubes, correct? And distribute my load between three. This is how people are using these kinds of event sourced systems, and remember that it's not one read model that you will have, most systems require two, three, four different read models to actually work well. And there’s massive accidental complexity that comes from trying to use only a single read model.
Now, to summarize, state transitions are important concepts in our domains. The result of an operation, what that means is, an important concept it's a fact, overall getters and setters and domain models are code-smell: if you start seeing lots of them, start thinking about what you're doing. And a single thing that I want people to walk away from this with, is that you cannot, under any circumstances, have a single model that does everything for you, and does it well. It doesn't exist. There are different types of models and different models do good at different things, and I talked about this last year a lot in polyglot data, but I mean, how many of you have tried using SQL server full-text indexing and compare that to something like Lucene? How many of you poorly implemented a graph inside of a SQL server? All of this is accidental complexity, and you will not end up with a single model that will work well for everything. Just like event sourcing doesn't work well for everything. You can't do a query off of your current state in a purely event-sourced system, you need some piece of transient state to be able to query with it. It will help you and save you a lot of accidental complexity.
So with that I think we have, like, one minute or one and a half minutes for questions.
[Voice offscreen] There's just in your state transition, where your events, and in your event store is there value in saving your commands corresponding with those state transitions?
So, I tend to save my commands if only so I can see what my external stimuli was, and I use it a lot for looking at how people are using my system, and I use it a lot for things like debugging or smoke testing, but it's not like they're a core part of my system. I do save them but not everyone does save them.
[Voice offscreen] In the event sourced model, how do you handle something like an over inventory problem or, over-selling?
When I load up my aggregate, my aggregate is fully consistent. I can tell you the exact number, and it's fully, 100% consistent so I can avoid that, to be fair in a warehouse you're never going to avoid that problem, however, and I'd go back and talk with the business people about this. They want to have a fully consistent warehouse, so how do you stop the people that are stealing and make them check out the inventory appropriately? At the end of the day, the warehouse is the book of record, not the computer system. And that kind of stuff does happen, so you will oversell things no matter what. And I would have that conversation with them, but in general, let's imagine that it was something where I really AM the book of record, then when I load up my aggregate for a given inventory item, it is 100% consistent, and I can say that we will never oversell a product. We don't have to have eventual consistency.
[Voice offscreen] In the event that, in the future your event model actually changes we’re gonna add more information for your event, so now you've got a history of events that are version A, and an even newer version B of that same event, and you were to play them, you get in the system between the two different models.
So how do we version events over time? I'm going to point you to a video on this, because I actually... on a video I gave, like, a 45 minute answer to this question, because it's not just one strategy. There's multiple strategies: it's at DDDCQRS.com it's about halfway through a seven-hour video talking about a lot of this, and looking in code and, well to be fair, I'm putting up about 20 hours of video now, seven hours is the short version. That's only one day's worth. But there's like a twenty to forty minute conversation on that. What most people are doing is they actually drop strong serialisation and they start using things like JSON where I load up an event, but I'm not guaranteed that that stuff's actually there because we have a weak serialization at this point. If it's not there and it gets rid of most of my versioning problems, but there's a couple rules you have to follow then, in terms of what you're actually doing in your changes.
[Voice offscreen] Can you, like, snapshot a new version?
Ah, so, snapshots I mentioned are evil because they have versioning problems. Just like all the versioning problems you're used to. If I go through and I ever change something associated with a snapshot, I have to replay all of my snapshots and it's really a pain. My rule of thumb is, I would not consider using a snapshot until I had more than a thousand events in stream, and a thousand events is a lot of events for an aggregate. Think about it: you create some documents, some aggregate in your system, and it comes into being for some period of time, let's say a mortgage application, and it lives over some period of time, and then eventually it's done. And normally it becomes immutable or sits there for a very, very long period of time. How many events would go into a mortgage application? 50? 100? Most aggregates have very, very few events. I would not even consider doing a snapshot till I hit about a thousand, maybe more.
[Voice offscreen, question is inaudible]
It depends. Oh so the question was “isn't it easier to always treat it as a new event or a new version of the old event?” It can actually make things more complicated, and it depends how you're using your types, because if I do that then I basically have to lock that type for history. So I'm on version 17, I've got 17 different versions and I have two up casting of all my types. Whereas if I use weak serialization I only keep the one that I actually understand right now and I always come off of, let's say JSON, into the one that I understand.
Any other questions? Well I'll thank all of you guys for coming out and I won't keep you from your lunch anymore. I know you want to be the first group out so get right to the lunch line but thanks!