Marathon/Mesos集群排错方法是什么

发布时间：2021-12-02 16:29:57 来源：亿速云阅读：127 作者：iii 栏目：大数据

本篇内容主要讲解“Marathon/Mesos集群排错方法是什么”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“Marathon/Mesos集群排错方法是什么”吧!

问题一

问题描述

部署某个镜像到Mesos集群的某个Agent一直停留在Waiting，但是在Mesos UI上发现这个Agent的资源是够的（4CPU/14G mem，只使用了1CPU/256M mem）。为了重现这个问题，我在这台Agent上部署了2048镜像，对应的Marathon Json文件：

{
  "id": "/2048-test",
  "cmd": null,
  "cpus": 0.01,
  "mem": 32,
  "disk": 0,
  "instances": 1,
  "constraints": [
    [      "hostname",      "CLUSTER",      "10.140.0.15"
    ]
  ],
  "container": {
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "alexwhen/docker-2048",
      "network": "BRIDGE",
      "privileged": false,
      "parameters": [],
      "forcePullImage": false
    }  },
  "portDefinitions": [
    {
      "port": 10008,
      "protocol": "tcp",
      "labels": {}    }
  ]}

环境

5 Mesos Slave/3 Mesos Master
CentOS 64bit
Marathon 1.0
Mesos 0.28.1

排错过程

查看Marathon log

docker logs marathon_container
...
run_jar --task_launch_timeout 900000 --zk zk://10.140.0.14:2181/marathon --event_subscriber http_callback --https_address 10.140.0.14 --http_address 10.140.0.14 --hostname 10.140.0.14 --master zk://10.140.0.14:2181/mesos --logging_level warn
run_jar --task_launch_timeout 900000 --zk zk://10.140.0.14:2181/marathon --event_subscriber http_callback --https_address 10.140.0.14 --http_address 10.140.0.14 --hostname 10.140.0.14 --master zk://10.140.0.14:2181/mesos --logging_level warn
...

没发现异常。

查看Martathon文档

目前位置笔者一直认为问题处在Marathon这边，所以就尝试去Marathon的Doc看看有没有常见的Troubleshooting。

果然有！An app Does Not Leave “Waiting”

This means that Marathon does not receive “Resource Offers” from Mesos that allow it to start tasks of this application. The simplest failure is that there are not sufficient resources available in the cluster or another framework hords all these resources. You can check the Mesos UI for available resources. Note that the required resources (such as CPU, Mem, Disk) have to be all available on a single host.
If you do not find the solution yourself and you create a github issue, please append the output of Mesos /state endpoint to the bug report so that we can inspect available cluster resources.

根据提示去找Mesos的/state信息。

根据Mesos state API得到当前Mesos集群的所有状态信息的Json文件：

然后到在线Json编辑器中格式化后查看Agent中的资源分配现状：

 "resources": {
        "cpus": 4,
        "disk": 97267,
        "mem": 14016,
        "ports": "[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]"
      },
      "used_resources": {
        "cpus": 1,
        "disk": 0,
        "mem": 128,
        "ports": "[16957-16957]"
      },
      "offered_resources": {
        "cpus": 0,
        "disk": 0,
        "mem": 0
      },
      "reserved_resources": {
        "foo": {
          "cpus": 3,
          "disk": 0,
          "mem": 10000
        }
      },
      "unreserved_resources": {
        "cpus": 1,
        "disk": 97267,
        "mem": 4016,
        "ports": "[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]"
      }

从中可以发现：虽然只使用了1CPU 128M mem，但是为foo保留了3CPU 10000M mem，这直接导致没有足够的CPU资源。这是Marathon无法部署container到Mesos Agent的根本原因。

解决问题

只需要将这个Agent上的资源疼出来就好了：

更改Marathon上的Json文件将这个Agent上的App部署到其它Agent上。

总结

遇到问题先去查看log
因为是开源项目，log中没发现问题可以去浏览项目的documentation，一般像Marathon/Spark开源项目都会提供Troubleshooting类似的文档说明
Mesos/Marathon集群虽然是开源项目，但是涉及的知识点还是很多的。这里要把大问题化解成小问题或者在笔记本上分析问题，标记出重要的问题都是很好的方法
Mesos /state API是分析集群的好帮手

问题二

问题描述

简言之就是Marathon部署的container一直显示waiting，但是这个可不是资源的问题，这个是docker image的问题。

公司同事开发了开源项目linkerConnector，主要目的就是读取Linux的/proc目录，收集进程的信息。为了方便部署，我把这个Golang Project容器化了，容器化的使用方法在这里。但是部署到Mesos Cluster一直失败，Marathon一直显示waiting。

环境描述

同问题一

排错过程

查看失败的container

登录到Mesos Agent，docker ps -a：

b13e79caca0a        linkerrepository/linker_connector        "/bin/sh -c '/linkerC"   17 minutes ago      Created                                    mesos-c64aa327-a803-40bb-9239-91fbd

docker inspect container:

"State": {
            "Status": "Created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2016-08-26T08:22:40.713934966Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        }

因为之前失败的container都被我删除了，上述输出是根据现有container修改的，但是信息是和之前对应的。自己分析

随着个项目的更新以及重新构建镜像后，这个问题解决了，但是我分析出了原因：

container需要挂在主机的/proc目录
我直接-v /proc:/proc
容器中的服务会写进程信息到容器的/proc目录，主机同时也会写信息到主机的/proc目录，因为容器的/proc和主机的/proc挂载在一起，这就导致读写冲突了，所以容器一直启动失败。

解决方案

将主机的/proc挂在到容器的非/proc目录，同时传餐告诉容器中的服务要到哪读取/proc信息

到此，相信大家对“Marathon/Mesos集群排错方法是什么”有了更深的了解，不妨来实际操作一番吧！这里是亿速云网站，更多相关内容可以进入相关频道进行查询，关注我们，继续学习！

向AI问一下细节

Marathon/Mesos集群排错方法是什么

问题一

问题描述

环境

排错过程

查看Marathon log

查看Martathon文档

解决问题

总结

问题二

问题描述

环境描述

排错过程

查看失败的container

解决方案

猜你喜欢

最新资讯

相关推荐

相关标签