【docker基础知识】Libcontainer原理

拼搏现实的明天。 2022-09-30 07:28 255阅读 0赞

一. Libcontainer概述

  1. 用于容器管理的包,管理`namespaces``cgroups``capabilities`以及文件系统来对容器控制。可用Libcontainer创建容器,并对容器进行管理。`pivot_root` 用于改变进程的根目录,可以将进程控制在`rootfs`中。如果`rootfs`是基于`ramfs`的(不支持`pivot_root`),那会在`mount`时使用`MS_MOVE`标志位加上`chroot`来顶替。

Libcontainer通过接口的方式定义了一系列容器管理的操作,包括处理容器的创建(Factory)、容器生命周期管理(Container)、进程生命周期管理(Process)等一系列接口。

二. 容器启动过程

Center

在Libcontainer中,p.cmd.Start创建子进程,就进入了pipe wait等待父写入pipe,p.cmd.Start创建了新的Namespace,这时子进程就已经在新的Namespace里了。 daemon线程在执行p.manager.Apply,创建新的Cgroup,并把子进程放到新的Cgroup中。 daemon线程做一些网络配置,会把容器的配置信息通过管道发给子进程。同时让子进程继续往下执行。 daemon线程则进入pipe wait阶段,容器剩下的初始化由子进程完成了。 rootfs的切换在setupRootfs函数中。(首先子进程会根据config,把host上的相关目录mount到容器的rootfs中,或挂载到一些虚拟文件系统上,这些挂载信息可能是-v指定的volume、容器的Cgroup信息、proc文件系统等)。 完成文件系统操作,就执行syscall.PivotRoot把容器的根文件系统切换rootfs 再做一些hostname及安全配置,就可以调用syscall.Exec执行容器中的init进程了 容器完成创建和运行操作,同时通知了父进程,此时,daemon线程会回到Docker的函数中,执行等待容器进程结束的操作,整个过程完成

三. 容器检查点保存Checkpoint

  1. 收集进程与其子进程构成的树,并冻结所有进程。
  2. 收集任务(包括进程和线程)使用的所有资源,并保存。
  3. 清理收集资源的相关寄生代码,并与进程分离。

四. 容器检查点恢复Restore

  1. 读取快照文件并解析出共享的资源,对多个进程共享的资源优先恢复,其他资源则随后需要时恢复。
  2. 使用fork恢复整个进程树,注意此时并不恢复线程,在第4步恢复。
  3. 恢复所有基础任务(包括进程和线程)资源,除了内存映射、计时器、证书和线程。这一步主要打开文件、准备namespace、创建socket连接等。
  4. 恢复进程运行的上下文环境,恢复剩下的其他资源,继续运行进程。

五. 配置结构体Config

  1. // Config defines configuration options for executing a process inside a contained environment.
  2. type Config struct {
  3. // NoPivotRoot will use MS_MOVE and a chroot to jail the process into the container's rootfs
  4. // This is a common option when the container is running in ramdisk
  5. NoPivotRoot bool `json:"no_pivot_root"`
  6. // ParentDeathSignal specifies the signal that is sent to the container's process in the case
  7. // that the parent process dies.
  8. ParentDeathSignal int `json:"parent_death_signal"`
  9. // PivotDir allows a custom directory inside the container's root filesystem to be used as pivot, when NoPivotRoot is not set.
  10. // When a custom PivotDir not set, a temporary dir inside the root filesystem will be used. The pivot dir needs to be writeable.
  11. // This is required when using read only root filesystems. In these cases, a read/writeable path can be (bind) mounted somewhere inside the root filesystem to act as pivot.
  12. PivotDir string `json:"pivot_dir"`
  13. // Path to a directory containing the container's root filesystem.
  14. Rootfs string `json:"rootfs"`
  15. // Readonlyfs will remount the container's rootfs as readonly where only externally mounted
  16. // bind mounts are writtable.
  17. Readonlyfs bool `json:"readonlyfs"`
  18. // Specifies the mount propagation flags to be applied to /.
  19. RootPropagation int `json:"rootPropagation"`
  20. // Mounts specify additional source and destination paths that will be mounted inside the container's
  21. // rootfs and mount namespace if specified
  22. Mounts []*Mount `json:"mounts"`
  23. // The device nodes that should be automatically created within the container upon container start. Note, make sure that the node is marked as allowed in the cgroup as well!
  24. Devices []*Device `json:"devices"`
  25. MountLabel string `json:"mount_label"`
  26. // Hostname optionally sets the container's hostname if provided
  27. Hostname string `json:"hostname"`
  28. // Namespaces specifies the container's namespaces that it should setup when cloning the init process
  29. // If a namespace is not provided that namespace is shared from the container's parent process
  30. Namespaces Namespaces `json:"namespaces"`
  31. // Capabilities specify the capabilities to keep when executing the process inside the container
  32. // All capbilities not specified will be dropped from the processes capability mask
  33. Capabilities []string `json:"capabilities"`
  34. // Networks specifies the container's network setup to be created
  35. Networks []*Network `json:"networks"`
  36. // Routes can be specified to create entries in the route table as the container is started
  37. Routes []*Route `json:"routes"`
  38. // Cgroups specifies specific cgroup settings for the various subsystems that the container is
  39. // placed into to limit the resources the container has available
  40. Cgroups *Cgroup `json:"cgroups"`
  41. // AppArmorProfile specifies the profile to apply to the process running in the container and is
  42. // change at the time the process is execed
  43. AppArmorProfile string `json:"apparmor_profile,omitempty"`
  44. // ProcessLabel specifies the label to apply to the process running in the container. It is
  45. // commonly used by selinux
  46. ProcessLabel string `json:"process_label,omitempty"`
  47. // Rlimits specifies the resource limits, such as max open files, to set in the container
  48. // If Rlimits are not set, the container will inherit rlimits from the parent process
  49. Rlimits []Rlimit `json:"rlimits,omitempty"`
  50. // OomScoreAdj specifies the adjustment to be made by the kernel when calculating oom scores
  51. // for a process. Valid values are between the range [-1000, '1000'], where processes with
  52. // higher scores are preferred for being killed.
  53. // More information about kernel oom score calculation here: https://lwn.net/Articles/317814/
  54. OomScoreAdj int `json:"oom_score_adj"`
  55. // UidMappings is an array of User ID mappings for User Namespaces
  56. UidMappings []IDMap `json:"uid_mappings"`
  57. // GidMappings is an array of Group ID mappings for User Namespaces
  58. GidMappings []IDMap `json:"gid_mappings"`
  59. // MaskPaths specifies paths within the container's rootfs to mask over with a bind
  60. // mount pointing to /dev/null as to prevent reads of the file.
  61. MaskPaths []string `json:"mask_paths"`
  62. // ReadonlyPaths specifies paths within the container's rootfs to remount as read-only
  63. // so that these files prevent any writes.
  64. ReadonlyPaths []string `json:"readonly_paths"`
  65. // Sysctl is a map of properties and their values. It is the equivalent of using
  66. // sysctl -w my.property.name value in Linux.
  67. Sysctl map[string]string `json:"sysctl"`
  68. // Seccomp allows actions to be taken whenever a syscall is made within the container.
  69. // A number of rules are given, each having an action to be taken if a syscall matches it.
  70. // A default action to be taken if no rules match is also given.
  71. Seccomp *Seccomp `json:"seccomp"`
  72. // NoNewPrivileges controls whether processes in the container can gain additional privileges.
  73. NoNewPrivileges bool `json:"no_new_privileges,omitempty"`
  74. // Hooks are a collection of actions to perform at various container lifecycle events.
  75. // CommandHooks are serialized to JSON, but other hooks are not.
  76. Hooks *Hooks
  77. // Version is the version of opencontainer specification that is supported.
  78. Version string `json:"version"`
  79. // Labels are user defined metadata that is stored in the config and populated on the state
  80. Labels []string `json:"labels"`
  81. // NoNewKeyring will not allocated a new session keyring for the container. It will use the
  82. // callers keyring in this case.
  83. NoNewKeyring bool `json:"no_new_keyring"`
  84. }

六. 容器接口BaseContainer

  1. // BaseContainer is a libcontainer container object.
  2. //
  3. // Each container is thread-safe within the same process. Since a container can
  4. // be destroyed by a separate process, any function may return that the container
  5. // was not found. BaseContainer includes methods that are platform agnostic.
  6. type BaseContainer interface {
  7. // Returns the ID of the container
  8. ID() string
  9. // Returns the current status of the container.
  10. //
  11. // errors:
  12. // ContainerNotExists - Container no longer exists,
  13. // Systemerror - System error.
  14. Status() (Status, error)
  15. // State returns the current container's state information.
  16. //
  17. // errors:
  18. // SystemError - System error.
  19. State() (*State, error)
  20. // Returns the current config of the container.
  21. Config() configs.Config
  22. // Returns the PIDs inside this container. The PIDs are in the namespace of the calling process.
  23. //
  24. // errors:
  25. // ContainerNotExists - Container no longer exists,
  26. // Systemerror - System error.
  27. //
  28. // Some of the returned PIDs may no longer refer to processes in the Container, unless
  29. // the Container state is PAUSED in which case every PID in the slice is valid.
  30. Processes() ([]int, error)
  31. // Returns statistics for the container.
  32. //
  33. // errors:
  34. // ContainerNotExists - Container no longer exists,
  35. // Systemerror - System error.
  36. Stats() (*Stats, error)
  37. // Set resources of container as configured
  38. //
  39. // We can use this to change resources when containers are running.
  40. //
  41. // errors:
  42. // SystemError - System error.
  43. Set(config configs.Config) error
  44. // Start a process inside the container. Returns error if process fails to
  45. // start. You can track process lifecycle with passed Process structure.
  46. //
  47. // errors:
  48. // ContainerNotExists - Container no longer exists,
  49. // ConfigInvalid - config is invalid,
  50. // ContainerPaused - Container is paused,
  51. // SystemError - System error.
  52. Start(process *Process) (err error)
  53. // Run immediatly starts the process inside the conatiner. Returns error if process
  54. // fails to start. It does not block waiting for the exec fifo after start returns but
  55. // opens the fifo after start returns.
  56. //
  57. // errors:
  58. // ContainerNotExists - Container no longer exists,
  59. // ConfigInvalid - config is invalid,
  60. // ContainerPaused - Container is paused,
  61. // SystemError - System error.
  62. Run(process *Process) (err error)
  63. // Destroys the container after killing all running processes.
  64. //
  65. // Any event registrations are removed before the container is destroyed.
  66. // No error is returned if the container is already destroyed.
  67. //
  68. // errors:
  69. // SystemError - System error.
  70. Destroy() error
  71. // Signal sends the provided signal code to the container's initial process.
  72. //
  73. // errors:
  74. // SystemError - System error.
  75. Signal(s os.Signal) error
  76. // Exec signals the container to exec the users process at the end of the init.
  77. //
  78. // errors:
  79. // SystemError - System error.
  80. Exec() error
  81. }

六. Factory接口

Factory对象为容器创建和初始化工作提供了一组抽象接口

  1. type Factory interface {
  2. // Creates a new container with the given id and starts the initial process inside it.
  3. // id must be a string containing only letters, digits and underscores and must contain
  4. // between 1 and 1024 characters, inclusive.
  5. //
  6. // The id must not already be in use by an existing container. Containers created using
  7. // a factory with the same path (and file system) must have distinct ids.
  8. //
  9. // Returns the new container with a running process.
  10. //
  11. // errors:
  12. // IdInUse - id is already in use by a container
  13. // InvalidIdFormat - id has incorrect format
  14. // ConfigInvalid - config is invalid
  15. // Systemerror - System error
  16. //
  17. // On error, any partially created container parts are cleaned up (the operation is atomic).
  18. Create(id string, config *configs.Config) (Container, error)
  19. // Load takes an ID for an existing container and returns the container information
  20. // from the state. This presents a read only view of the container.
  21. //
  22. // errors:
  23. // Path does not exist
  24. // Container is stopped
  25. // System error
  26. Load(id string) (Container, error)
  27. // StartInitialization is an internal API to libcontainer used during the reexec of the
  28. // container.
  29. //
  30. // Errors:
  31. // Pipe connection error
  32. // System error
  33. StartInitialization() error
  34. // Type returns info string about factory type (e.g. lxc, libcontainer...)
  35. Type() string
  36. }

发表评论

表情:
评论列表 (有 0 条评论,255人围观)

还没有评论,来说两句吧...

相关阅读

    相关 docker基础知识总结

    什么是docker? docker是容器化技术。 Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的镜像中,然后发布到任何流行的