7 Metrics to Watch When Monitoring FeatureBase Clusters

When running FeatureBase, you can keep an eye on a couple of key metrics to ensure you're achieving optimal performance. Watch the video to see Distinguished Engineer, Matt Jaffee, walk through what he monitors in our FeatureBase Cloud product and why he pays closer attention to certain metrics. For this video, Jaffee is looking at a DataDog dashboard, but this could also be done using other tools. If you don't have time to watch the full video, we've included a couple of highlights below. Some of these metrics are FeatureBase-specific, but many are not exclusive to FeatureBase, and could be helpful for any database cluster monitoring.

‍

7 Key Metrics to Watch

User CPU Usage by Host: This is specifically the user (not taking system into account). If you're running dedicated hosts, this will give you a view into how much CPU FeatureBase is using.
Load Average: This is actually a pretty complex metric, so we won't dive deep into this, but we keep it in the dashboard for reference.
Used Memory per Host + Usable Memory per Host: FeatureBase has a lot of files mmapped, which means that used memory per host and the below usable memory per host metric are not inverses of each other. When you mmap a file, that memory counts as being used until it gets paged out. This means that if you have plenty of memory and a bunch of mmapped files, it'll look like your memory is used, but it's all still usable. This means that, if you're looking at your FeatureBase metrics and you notice that usable memory takes a dive, it could be that you're running a huge query, but it also could mean there's a memory leak or a bug that we can work with you to fix. In a steady state, with FeatureBase your usable memory should look almost as big as your used memory.
Disk and I/O Metrics: These include things like Disk Latency, I/O Wait (how long are I/O operations taking), etc. The main thing to keep in mind with this group of metrics is that when you're doing ingest, you'll know you've hit the hardware limits when your I/O Wait starts spiking. You also can keep an eye on Average I/O Across Hosts to get a feel for how much bandwidth you have on your particular system and when you're saturating that. It's important to watch out for both bandwidth and for individual I/O operations (Avg Ops Across All Volumes).
Disk Usage: If you're ingesting data, you'll see disk usage climbing and it should correlate to the Bulk Inserts per Second per Host graph, which shows you how many records per second are coming through bulk insert queries. (Note: it can be helpful to include a moving average in this graph).
Network Traffic: For FeatureBase Cloud, these are EC2 specific metrics, but you can pull host metrics that are similar for other networks. Keep in mind that if you're using EC2, EBS Volumes count towards network traffic as well, so if you're doing a lot of writes to EBS, you'll see those on your EBS-specific charts and they'll be reflected in network traffic charts.
SQL Requests per Second: Any SQL query that comes in, not just inserts, but any query, will show in this chart.

If you have any questions or are interested in the JSON configuration for this DataDog dashboard, please reach out or send us a note in Discord!