Neuron Monitor Addon¶
Neuron Monitor collects metrics and stats from the Neuron Applications running on the system and streams the collected data to stdout in JSON format.
These metrics and stats are organized into metric groups which can be configured by providing a configuration file as described in Using neuron-monitor
When running, neuron-monitor will:
- Collect the data for the metric groups which, based on the elapsed time since their last update, need to be updated
- Take the newly collected data and consolidate it into a large report
- Serialize that report to JSON and stream it to stdout from where it can be consumed by other tools - such as the sample neuron-monitor-cloudwatch.py and neuron-monitor-prometheus.py scripts.
- Wait until at least one metric group needs to be collected and repeat this flow
Usage¶
index.ts
¶
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import * as blueprints from '@aws-quickstart/eks-blueprints';
const app = new cdk.App();
const neuronMonitorAddon = new blueprints.addons.NeuronMonitorAddOn()
const clusterProvider = new blueprints.GenericClusterProvider({
version: KubernetesVersion.V1_27,
managedNodeGroups: [
inferentiaNodeGroup()
]
});
function inferentiaNodeGroup(): blueprints.ManagedNodeGroup {
return {
id: "mng1",
instanceTypes: [new ec2.InstanceType('inf1.2xlarge')],
desiredSize: 1,
maxSize: 2,
nodeGroupSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
};
}
const blueprint = blueprints.EksBlueprint.builder()
.clusterProvider(clusterProvider)
.addOns(neuronMonitorAddon)
.build(app, 'my-stack-name');
Once deployed, you can see the monitor and device plugin deamonsets in the kube-system
namespace.
$ kubectl get daemonset -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
neuron-monitor 1 1 1 1 1 <none> 3m12s