Unlimited Advanced Hunting for Microsoft 365 Defender with Azure Data Explorer – Part II

16 min readJun 7, 2023

Introduction

A month ago I've published an article about extending your Microsoft 365 Defender logs beyond the default of 30 days by leveraging Azure Event Hubs and Azure Data Explorer.

I promised a second part to that article where I want to zoom in on sizing, performance and cost considerations.

If you haven't read the first part, I'd advise you to start there and come back afterwards.

IIf you’re mainly interested in getting a better feel for a real-world scenario and respective costs, click here to go straight into the conclusion.

Part I | Introduction and automated deployment
– Architectural overview
– Configuring Microsoft 365 Defender
– Preparations for automated deployment
– Running the DefenderArchiveR PowerShell Script for automated deployment
Part II | In-depth & design choices [📍you are here ]
– Calculating Defender log size
– Choosing the right Event Hub namespace(s)
– Deciding on Azure Data Explorer tiers and options
– ADX caching and compression

P
Please visit my Github repository to get started with DefenderArchiveR.ps1!

There is no 'I' in 'team'

Before I dive straight into the technical bits and pieces, I'd like to express my gratitude towards both Mark Harburn and Ondřej Spáčil. Mark reached out to me after my previous post because he was aware that his colleague Ondřej already tried a similar approach, and ran into (performance) issues.

Over the course of a few weeks Ondřej and I kept in touch and worked together on this, and we were able to help each other out resolving the issues on both sides. (Issues I wasn't even aware I had at the time!)

It was great to see that we just wanted to help each other out without any financial motivation. Just because we wanted the solution to succeed, so that others can also benefit from it in the future.

During these weeks I learned a lot about Azure Event Hub, Azure Data Explorer and all the intricate details that go into monitoring performance and optimizing cost. So it was actually perfect that my article was setup as a two-part story, of which the second part now can benefit from some improvements I picked up along the way…

Thanks guys! 🇳🇱 🇨🇿 🇩🇪 🙏🏻 #community

Event Hub sizing

As detailed in my previous article, the solution is dependent on multiple Azure Event Hub namespaces with each a number of Event Hubs depending on how many tables you want to archive. This makes sure each table in Defender get its own stream to Azure Data Explorer (ADX) with its own schema and independent performance optimization. (more on this later)

Overview of the solution where multiple Event Hubs span multiple Event Hub Namespaces

Tiers

Azure Event Hub Namespaces are available in different tiers: Basic, Standard, Premium and Dedicated. I currently have this setup running at a couple of large global enterprises, for all of which the Standard tier was sufficient. So, that's currently the default tier used in my code.

Premium and Dedicated might have some advantages for the largest of enterprise organizations. But these tiers are much more expensive and I think this will defeat the purpose of saving money for not storing this data into Microsoft Sentinel.

You might be able to use Basic tier as well. But due to the limited retention capabilities you might risk loosing data if something might go wrong on the Azure Data Explorer ingestion side. Besides; Dev doesn't come with an SLA.

Partition count

Within each Event Hub Namespace you can create a maximum of ten Event Hubs each which their own partition count.

For Premium and Dedicated tiers the maximum amount of Event Hubs per Namespace is 100 and 1000 respectively.

Partitions make sure the sequences of events sent to an Event Hub are organized. Each Event Hub can have between 1 and 32 partitions. (for Basic and Standard tier)

When Streaming API is enabled from Defender, and the Event Hub does not exist yet, it will be created automatically with a partition count of 4.

I assumed this was okay at first. But for larger enterprises this quickly ended up to be another bottleneck. That is why DefenderArchiveR.ps1 will now create Event Hubs with 32 partitions as default. Which is not a problem at all since these partition counts will not infer any additional costs.

Keep in mind that you cannot adjust the partition count once the Event Hub is created! So if you happen to need to delete and re-deploy Event Hubs, and Defender Streaming API is still configured pointing towards that Event Hub, it will be re-created quickly with a default partition count of 4. If you need to redeploy/redistribute your Event Hubs across your Namespaces, don't forget to temporarily remove your Streaming settings in Defender.

Throughput Units

The most cost inducing setting for Azure Event Hubs is going to be the amount of throughput units enabled on the Event Hub Namespace. A Standard tier Namespace support 1–40 throughput units (TPUs). This value determines the capacity and performance of the whole Event Hub Namespace.

A single throughput unit (TPU) lets you:

Ingress | up to 1 MB per second or 1000 events per second (whichever comes first)
Egress | up to 2 MB per second or 4096 events per second (whichever comes first)

An Event Hub Namespace is able to auto-inflate its TPUs based on the load. It is however not able to de-flate automatically! You might be able to script an automated solution for this, but be aware that if the Event Hub Namespace lacks capacity on peak loads, you might lose some logs if auto-inflation is not quick enough to increase TPUs.

For this particular solution our Ingres and Egress will be the same since we wont be retrieving the Event Hub messages from multiple solutions. So, we only need to look at Ingress performance thresholds.

In my experience, with large organizations, I never ran into the message count limitations but always ran into the throughput limitations instead. And in those cases I also quickly ran out of available throughput units when multiple tables, responsible for creating massive amounts of logs, were coming in on the same Namespace.

A few examples of log intensive tables are:

DeviceProcessEvents
DeviceEvents
DeviceNetworkEvents

That’s why dividing the Event Hubs for the respective tables evenly across the Namespace is very important.

My latest version of DefenderArchiveR.ps1 will now perform a performance check to determine which tables creates the most logs, and distribute these across multiple Namespace to limit bottlenecks as much as possible.

Optimize cost and performance

Exactly calculating the right amount of Event Hub Namespaces, and necessary throughput units, is unfortunately harder than it sounds. Microsoft provides a KQL query which should be able to determine Event Hub capacity. But I'm sorry to say that this query is only partially useful.

let bytes_ = 500;
union withsource=MDTables*
| where Timestamp > startofday(ago(6h))
| summarize count() by bin(Timestamp, 1m), MDTables
| extend EPS = count_ /60
| summarize avg(EPS), estimatedMBPerSec = (avg(EPS) * bytes_ ) / (1024*1024) by MDTables
| sort by toint(estimatedMBPerSec) desckql

Microsoft assumes each log entry has a size of about 500 bytes. But in my experience this was waaaay off. Some tables produce log events of up to 2,5MB each!

In practice, I've quickly noticed that the amounts of throughput (in Megabytes) coming through my Event Hub Namespaces was much higher than I'd anticipated.

Calculate table sizes more exactly

To do an actual calculation of a table, we need to perform some sort of concatenation across all columns of a single row, and summarize it total character count to get near any useful average estimate. But since each tables has different column names, this isn't as straightforward.

First we need to use a KQL query to construct the actual KQL we need to do this calculation. Let me demonstrate:

let TimeFilter = "where Timestamp between (startofday(ago(1d)) .. endofday(ago(1d)))";
DeviceEvents
| getschema
| summarize CalculateStringLengths = array_strcat(make_list(strcat("strlen(tostring(", ColumnName, "))")), " + ")
| project strcat("CloudAppEvents", " | ", TimeFilter, " | project totalLengthBytes = ", CalculateStringLengths, " | summarize totalThroughputMB = sum(totalLengthBytes) / (1024 * 1024) * 2")

This query will retrieve the schema of the DeviceEvents table and will output an array of string as CalculateStringLenghts. Lastly, a project statement will string everything together and the result from this query will look something like this:

As you can see a whole bunch of strlen functions are used to determine the totale string length of all rows combined within a given timeframe.

Copy the results, paste as a new query and run it. It will look back at the previous day (from 00:00 to 00:00), add everything up and summarize a totalLengthBytes followed by some additional calculations to end in a number of Gigabytes for that given day.

The query above resulted in a totalThroughputMB of 533.544 in my case. That's 521 Gigabytes of data in the previous day on this single table.

If you look closely it will multiply the results by 2. Apparently there is some significant overhead involved when it comes to streaming logs to Event Hubs. During my thorough investigations I landed on this number which gave the best representation which the numbers I saw on the Event Hub throughput metrics across all tables.

I tried implementing these queries as part of DefenderArchiveR.ps1 but unfortunately I quickly ran into the API performance constraints for these resource-intensive queries. In my Github repository you'll find a new script named CalculateTableSizing.ps1 which will loop through all tables and gather the total throughput and will output this together with an Event Hub TPU estimation in a file tableStatistics.json. Feel free to use it to help you get a better felling of what you're getting into beforehand. But keep in mind that you might end up having to run this multiple times on a shorter selection of tables.

DefenderArchiveR.ps1 optimizations

While throughput might be a more important indicator for performance and pricing than messages per second, calculating the latter might still help us getting an idea about the activities within each table. And this is also more easily retrieved over the API without constantly running into throttling issues.

let m365defenderTables = datatable (tableName:string)[  
    <TABLES>
];
union withsource=MDETables*
| where Timestamp between (startofday(ago(8d)) .. endofday(ago(1d)))    // Look back at last 7 full days
| where MDETables in (m365defenderTables)                               // Only look at tables we want to archive
| summarize EventsPerMin = count() by bin(Timestamp, 1m), MDETables     // Count events per minute per table
| summarize MaxEventsPerMin = arg_max(EventsPerMin, *) by MDETables     // Find max events per minute per table
| project MDETable = MDETables, round(tolong(MaxEventsPerMin),2)        // Round to 2 decimals
| extend TPU = MaxEventsPerMin / 60 / 1000

<TABLES> is replaced in-line by DefenderArchiveR.ps1 with your own selection of tables you want to archive.

With a single query we can retrieve all tables at once, look at the amounts of logs in time periods of 1 minute (bin) across the last seven days and come up with a ballpark figure of Events per Second (EPS) or messages per second as Event Hub will process them.

The query will add a TPU columns with an estimated required amount of throughput units per table. This will be used by DefenderArchiveR.ps1 to distribute the tables as evenly as possible to try to avoid any throttling issues.

Visualization of how ArchiveR distributed high-load Event Hubs across Namespaces

In practice you'll end up seeing the TPUs inflate beyond these numbers due to spikes which cannot be calculated precisely from this query and this dataset. More on this below…

Based on the EPS/TPU calculations DefenderArchiveR will distribute the tables evenly across Namespaces

Keep an eye on Event Hub metrics

There are mainly three metrics which you should monitor closely to check and see if everything is in order at the Event Hub department:

Messages
Throughput
Throttled Requests

Messages
This metric will show incoming and outgoing messages. These two lines should be matching up almost perfectly indicating that Azure Data Explorer is also able to keep up pulling the data out of Event Hubs.

An example of both ingres and egress lining up perfectly

Throughput
Keep an eye on the throughput on the Event Hub Namespace level and zoom in on particular Event Hubs to see if spikes aren't causing to reach your throughput unit limits. Remember: 1 TPU ~= 1MB/sec. So no spikes should surpass 20MB/sec when you're running with 20 TPUs on the Namespace. This is also why I've decided to enable auto-inflation as default within the code.

Set metrics interval at one minute and divide by 60 to get a better feel for those specific spikes:

While a full day might display 356 GB of data, which will average out at about **4 MB/sec**, some spikes might reach the ceiling of your throughput units like this spike of **22 MB/sec**.

Throttling
If your Event Hub Namespace fails to keep up with incoming data, throttled request will occur. Keep an eye on this metric and make sure that you aren't seeing many:

Sometime you might see a few of them as illustrated below. This might indicate (not guarantee) that you're actually facing data loss.

You can only determine how much throttled requests are acceptable by checking your data set. You can double-check you data consistency by comparing Azure Data Explorer table data with Defender's by running the following KQL query on both sides:

union withsource=tables*
| where Timestamp between (startofday(datetime("05-22-2023")) .. endofday(datetime("05-28-2023")))
| summarize count() by tables
| sort by tables asc

This will output an exact row count per table within a give time period. By running the same query on both ends, you'll be able to compare the results.

Azure Data Explorer sizing

Azure Data Explorer (ADX) comes with its own set of choices and I'd also like to share a couple of tips in regards to making sure your setup runs smoothly and you don't run into any troubles.

Tiers

ADX also comes in a bunch of different tiers, ranging from a very affordable single-instance Developer tier all the way up to dedicated storage- and compute-optimized tiers with blazing fast hot caches and massively sized scaling options.

I believe that for this particular usecase you won't be needing the fastest of clusters. Because you'll be using Microsoft 365 Defender for your short-term high-performance queries anyway. My advise would be to start as low as possible, but make sure yout cluster can keep up with the data ingestion.

Once you really need to perform a lot of data querying, you can always dynamically scale-up and/or scale-out your cluster temporarily for more performance.

DefenderArchiveR.ps1 will deploy a Developer-tiered cluster with two Standard_E2a_v4 instances by default. I suggest you keep an eye on those metrics after deployment to check if performance is sufficient. If you want want the solution to run in production, I'd suggest upgrading to a tier which will also come with an SLA.

To keep things as affordable as possible, I'd also advise not to implement any hot caching for the same reasons as stated above. If you really want to cache more than 30 days (already available in Defender) it will probably cost you a pretty penny since the amounts of data flowing into ADX will not be small by any means. And those SSDs just simply doesn't come cheap…

Data compression

ADX uses Azure Blob as an underlying storage technology. You won't have any flexibility about storage tiers (i.e. hot, cool, archive) but you do get data compression as a bonus!

Once data is flowing into ADX, you can retrieve table statistics by running: .show table <TABLENAME> data statistics.

If you want to get some summarizations, and compare compressed VS uncompressed, you can run this query below:

.show table DeviceEvents data statistics
| as hint.materialized=true temp
| extend TotalOriginalSize = toscalar( temp
                                        | summarize sum(OriginalSize)
                                        )
| extend TotalDataCompressedSize = toscalar( temp
                                        | summarize sum(DataCompressedSize)
                                        )
| take 1
| project TotalOriginalSize, TotalDataCompressedSize
| extend diff = TotalOriginalSize / TotalDataCompressedSize

Most tables I looked at got up to 11 times compression! That great for storing costs which are already way lower than storing this into Microsoft Sentinel.

More on pricing and examples below…

Important metrics

There are a couple of metrics indicating a health running ADX cluster without any data ingestion issues:

CPU | An obvious once, it should nót look like the picture below. 😉
Ingestion Utilization | Should not max out otherwise the cluster isn't able to keep up with the data coming in.
Ingestion Latency | Indicates how much the ADX cluster is behind on data ingestion.

An example of a cluster which wasn't too happy a few hours earlier. CPU utilization was maxed out and data ingestion latency kept increasing. No wonder why my most recents logs were a couple of hours old!

A real-world example

And now let's get to the gist of things! The meat and potatoes! The nuts and bolts! The bottom line! The very essence of why we went into so much trouble setting up a relatively complex setup for archiving our data!

(I believe most of you came for this part… 😉)

I'd like to share some numbers of a real-world example at one of my customers. I hope this will help you get a better understanding of why this can be such a big deal, and maybe help you take an informed decision to (not) look into this any further.

This environment runs 157.075 devices according to DeviceInfo table. So this is a very large environment. Mileage may vary for your own environment. But even for this particular customer, the solution provided is still an enormous cost saving compared to storing these logs into Microsoft Sentinel with the native Data Connector. (cost comparison below)

They have all nineteen tables currently enabled and running nineteen Event Hubs across three separate Namespaces. Ingesting over 26 million messages a day with an average size of 2,29 TB each day. Each Namespace is currently running at 40 TPUs, so 120 TPU's in total. Average hourly operation will not hit these thresholds, but peaks sometimes do.

Because of this, it’s currently not the most optimal setup unfortunately. We’re hitting the limits sometimes and have seen tiny tiny amount of data loss in some tables because of this. (only a couple of hundreds of a percent) But when taking the sheer size of the total amount of logs into consideration within this particular large organization, we’ve decided not to chase these ghosts within tens of millions of events.

Especially because we're currently still running some query performance tests. Afterwards we'll be redeploying the solution. With the recent updates to DefenderArchiveR.ps1 this setup will be split across four separate Namespaces instead, giving it more room for throughput peaks, and eliminating throttling requests and data loss.

The Azure Data Explorer Cluster is a Standard_E8ads_v5. We had to scale up from the default Developer tier to keep up with data ingestion as seen in the alarming metrics above.

Current ADX has no troubles keeping up with data ingestion

Costs

If we put all of these parts into Azure Calculator, we can check how much all of this is going to cost us.

Four Event Hub Namespaces with 35 TPU's each (16% increase over current setup) ingesting about 195 million messages a month.
Single ADX cluster running two compute-optimized Standard_E8ads_v5 nodes. Ingesting almost 70 TB a month without hot cache, keeping data retention for 365 days and estimating data compression at 8x. (which is still conservative I think)
Let's also apply 3-year Reserved Instances benefit to get up to 54% savings on compute.

Which sums up to a grand total of € 7.418,72.
(West Europe, without additional Enterprise Agreement discounts)

Link to this calculation on Azure Calculator

And while this is obviously a huge sum, let's take a close look at how much storing the same data into Microsoft Sentinel would've cost.

According to the Microsoft Pricing sheet for Microsoft Sentinel and Log Analytics, ingesting 2,3 TB a day will already cost you € 2667,21 a day (!!) on ingestion alone. That's over € 80.000,- a month!
(West Europe, without additional Enterprise Agreement discounts)

Ingesting as Basic Logs is not comparable to the archive solution with ADX since we cannot query Basic Logs with full KQL and data is only retained for eight days.

This is even without taking additional data retention into account. You could opt for Log Data Archive, but then you won't be able to query the data anytime you want, which you can do within ADX.

Included retention is 90 days. Additional data is billed at € 0,122 per GB per month.

275 days of extra retention x 2300 GB x 0,122 = € 77.165,- a month.

So, once you have waited until your data becomes one year old, you're spending about € 157.165,- a month compared to € 7.418,72!

Sure, I know. This is a little bit comparing apples 🍏 and oranges 🍊. Log Analytics is arguable not build for storing 820 TB of data as Analytics logs. And it's also MUCH faster to query compared to ADX! (hence the price difference)

You could also argue that keeping data for 365 days is a bit overkill.

Conclusion

Once your data is ingested in Azure Data Explorer, you can enjoy endless data retention and endless KQL queries, because your results are no longer limited to the same query limits that apply to Sentinel / Defender. ❤️

I'm also currently looking into connecting Logstash straight onto Azure Data Explorer. I think there is huge potential here, especially as an alternative to Sentinel Basic Logs. It's cheaper and even more usable.

But you also need to keep in mind that running Event Hub Namespaces and an ADX cluster also adds operational overhead and potential troubleshooting along their lifespans. So, I'm curious to hear what you think; are you going to look into Azure Data Explorer as well?

I'm proud that you made it to the end of this Part II! 😉 I was quite a ride for me digging into new Azure services like Event Hub and Data Explorer and figuring out how everything worked, specially from an Infrastructure-as-Code perspective. I hope everything was clearly explained and it sparked enough interessant to check it out yourself.

If you have any follow-up questions don't hesitate to reach out to me. Also follow me here on Medium or keep an eye on my Twitter and LinkedIn feeds to get notified about new articles here on Medium.

I still wouldn’t call myself an expert on PowerShell. So if you have feedback on any of my approaches, please let me know! Also never hesitate to fork my repository and submit a pull request. They always make me smile because I learn from them and it will help out others using these tools. 👌🏻

I hope you like this tool and it will make your environment safer as well!

If you have any follow-up questions, please reach out to me!

— Koos

G
Please visit my Github repository to get started with DefenderArchiveR.ps1!

Unlimited Advanced Hunting for Microsoft 365 Defender with Azure Data Explorer – Part II

Introduction

There is no 'I' in 'team'

Event Hub sizing

Tiers

Partition count

Throughput Units

Optimize cost and performance

Calculate table sizes more exactly

DefenderArchiveR.ps1 optimizations

Keep an eye on Event Hub metrics

Azure Data Explorer sizing

Tiers

Data compression

Important metrics

A real-world example

Costs

Conclusion

Written by Koos Goossens

Responses (1)