怎样通过Node.js日志监控系统状态

用 Node.js 日志监控系统状态的可落地方案

一核心架构与流程

日志采集：在应用内使用结构化日志库（如 Winston、Pino、Bunyan）输出统一格式的日志，便于检索与分析。
传输与集中：将日志发送到集中式系统（如 ELK Stack：Elasticsearch + Logstash + Kibana，或 Graylog、Fluentd），实现跨服务聚合与检索。
实时查看与告警：开发/运维侧用 tail -f、PM2 logs 实时观察；在 Kibana 或 Grafana 配置阈值告警（如错误率、响应时间异常）。
指标与可视化：除日志外，暴露 /metrics 端点，用 Prometheus + Grafana 做时序指标与仪表盘。
运行与轮转：用 PM2 守护进程并管理日志；用 logrotate 或 winston-daily-rotate-file 做日志切分与归档。

二落地步骤与关键代码示例

步骤1 结构化日志
使用 Winston 输出 JSON 日志，区分 error 与 combined 两类，便于后续检索与告警。

// logger.js
const winston = require('winston');
const { combine, timestamp, json } = winston.format;

const logger = winston.createLogger({
  level: 'info',
  format: combine(timestamp(), json()),
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
    new winston.transports.Console({ format: winston.format.simple() })
  ]
});

module.exports = logger;

在业务代码中打点：

const logger = require('./logger');
logger.info('user login', { userId: 'u123', ip: '1.2.3.4' });
logger.error('db connect failed', { err: err.message, retry: true });

步骤2 健康检查与系统状态
提供 /health 与 /status 端点，结合 os 模块输出关键运行信息，便于探针与快速排障。

// health.js
const express = require('express');
const os = require('os');
const logger = require('./logger');

const app = express();
app.get('/health', (req, res) => {
  const health = { status: 'UP', uptime: process.uptime() };
  logger.info('health check', health);
  res.json(health);
});

app.get('/status', (req, res) => {
  const mem = os.freemem() / os.totalmem();
  const status = {
    freeMemPct: (mem * 100).toFixed(2) + '%',
    totalMem: (os.totalmem() / 1024 / 1024 / 1024).toFixed(2) + ' GB',
    cpuCount: os.cpus().length,
    systemUptime: os.uptime()
  };
  logger.info('status snapshot', status);
  res.json(status);
});

app.listen(3000, () => logger.info('Server listening on 3000'));

步骤3 指标与可视化
使用 prom-client 暴露 /metrics，配合 Prometheus + Grafana 展示请求率、延迟、活跃请求等指标。

// metrics.js
const client = require('prom-client');
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'status']
});
const activeRequests = new client.Gauge({
  name: 'active_requests',
  help: 'Number of active HTTP requests'
});

function metricsMiddleware(req, res, next) {
  const end = httpRequestDuration.startTimer();
  activeRequests.inc();
  res.on('finish', () => {
    end({ method: req.method, status: res.statusCode });
    activeRequests.dec();
  });
  next();
}

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

步骤4 运行与日志管理
使用 PM2 启动与守护，并实时查看日志：

npm i -g pm2
pm2 start app.js --name my-app
pm2 logs my-app        # 实时查看
pm2 monit             # 资源监控

日志轮转（Linux 系统）：

# /etc/logrotate.d/myapp
/path/to/logs/*.log {
  daily
  missingok
  rotate 7
  compress
  notifempty
  create 0640 node node
}

三告警与可视化配置

日志告警：在 Kibana 配置规则，当 error.log 条目数在设定窗口内超过阈值或匹配关键错误模式时触发通知（如邮件、企业微信、钉钉、PagerDuty、Opsgenie）。
指标告警：在 Prometheus 配置规则（如 5xx 比例、P95 延迟、活跃请求异常），通过 Alertmanager 路由到通知渠道。
可视化：在 Grafana 构建仪表盘，覆盖请求率、错误率、P50/P95/P99 延迟、内存与 CPU 使用、活跃请求等核心面板。

四生产最佳实践

结构化与级别：统一使用 JSON 日志，合理设置 debug/info/warn/error 级别，避免生产环境过度打点。
采样与脱敏：对高频调试日志做采样；对 password、token、phone 等敏感字段脱敏后再写入。
异步与性能：采用异步/批量写入，避免日志阻塞主线程；为高频路径减少昂贵序列化。
上下文与追踪：在日志中携带 trace_id、span_id、request_id，便于串联全链路。
可靠传输：集中式日志通道应具备重试与缓冲，避免日志丢失。
容量与保留：按日/大小切分，设置合理保留周期与冷热分层，控制成本。
安全合规：限制日志访问权限，避免将敏感数据写入可被未授权访问的存储。

最新问答

相关标签