input/howto/install-ganeti.creole

   1 == How To Install Ganeti Clusters and Instances ==
   2
   3 === suppositions ===
   4
   5 Suppose that there are two identical hosts: foo.debian.org and bar.debian.org.
   6
   7 They are running stable and have been integrated into Debian infrastructure.
   8
   9 They will serve as nodes in a ganeti cluster named foobar.debian.org.
  10
  11 They have a RAID1 array exposing three partitions: c0d0p1 for /, c0d0p2 for
  12 swap and c0d0p3 for lvm volume groups to be used by ganeti via drbd.
  13
  14 They have two network interfaces: eth0 (public) and eth1 (private).
  15
  16 The public network is A.B.C.0/24 with gateway A.B.C.254.
  17
  18 The private network is E.F.G.0/24 with no gateway.
  19
  20 Suppose that the first instance to be hosted on foobar.debian.org is
  21 qux.debian.org.
  22
  23 The following DNS records exist:
  24
  25 {{{
  26     foobar.debian.org.                  IN A   A.B.C.1
  27     foo.debian.org.                     IN A   A.B.C.2
  28     bar.debian.org.                     IN A   A.B.C.3
  29     qux.debian.org.                     IN A   A.B.C.4
  30     foo.debprivate-hoster.debian.org.   IN A   E.F.G.2
  31     bar.debprivate-hoster.debian.org.   IN A   E.F.G.3
  32 }}}
  33
  34 === install required packages ===
  35
  36 On each node, install the required packages:
  37
  38 {{{
  39     # maybe: apt-get install drbd-utils
  40     # maybe: apt-get install ganeti-instance-debootstrap
  41     apt-get install ganeti2 ganeti-htools qemu-kvm
  42 }}}
  43
  44 === configure kernel modules ===
  45
  46 On each node, ensure that the required kernel modules are loaded at boot:
  47
  48 {{{
  49     ainsl /etc/modules 'drbd minor_count=255 usermode_helper=/bin/true'
  50     ainsl /etc/modules 'hmac'
  51     ainsl /etc/modules 'tun'
  52     ainsl /etc/modules 'ext3'
  53     ainsl /etc/modules 'ext4'
  54 }}}
  55
  56 === configure networking ===
  57
  58 On each node, ensure that br0 (not eth0) and eth1 are configured.
  59
  60 The bridge interface, br0, is used by the guest virtual machines to reach the
  61 public network.
  62
  63 If the guest virtual machines need to access the private network, then br1
  64 should be configured rather than eth1.
  65
  66 To prevent the link address changing due to startup/shutdown of virtual
  67 machines, explicitly set the value.
  68
  69 This is the interfaces file for foo.debian.org:
  70
  71 {{{
  72     auto br0
  73     iface br0 inet static
  74       bridge_ports eth0
  75       bridge_maxwait 0
  76       bridge_fd 0
  77       address A.B.C.2
  78       netmask 255.255.255.0
  79       gateway A.B.C.254
  80       up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
  81
  82     auto eth1
  83     iface eth1 inet static
  84       address E.F.G.2
  85       netmask 255.255.255.0
  86 }}}
  87
  88 This is the interfaces file for bar.debian.org:
  89
  90 {{{
  91     auto br0
  92     iface br0 inet static
  93       bridge_ports eth0
  94       bridge_maxwait 0
  95       bridge_fd 0
  96       address A.B.C.3
  97       netmask 255.255.255.0
  98       gateway A.B.C.254
  99       up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
 100
 101     auto eth1
 102     iface eth1 inet static
 103       address E.F.G.3
 104       netmask 255.255.255.0
 105 }}}
 106
 107 === configure lvm ===
 108
 109 On each node, configure lvm to ignore drbd devices and to prefer
 110 {{{/dev/cciss}}} devices names over {{{/dev/block}}} device names
 111 ([[https://code.google.com/p/ganeti/issues/detail?id=93|why?]]):
 112
 113 {{{
 114     ssed -i \
 115       -e 's#^\(\s*filter\s\).*#\1= [ "a|.*|", "r|/dev/drbd[0-9]+|" ]#' \
 116       -e 's#^\(\s*preferred_names\s\).*#\1= [ "^/dev/dm-*/", "^/dev/cciss/" ]#' \
 117       /etc/lvm/lvm.conf
 118     service lvm2 restart
 119 }}}
 120
 121 === create lvm volume groups ===
 122
 123 On each node, create a volume group:
 124
 125 {{{
 126     vgcreate vg_ganeti /dev/cciss/c0d0p3
 127 }}}
 128
 129 === exchange ssh keys ===
 130
 131 on each node:
 132
 133 {{{
 134    mkdir -m 0700 -p /root/.ssh &&
 135    ln -s /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa
 136 }}}
 137
 138 === configure iptables (via ferm) ===
 139
 140 the nodes must connect to each other over the public and private networks for a number of reasons; see the ganeti2 module in puppet
 141
 142 === instantiate the cluster ===
 143
 144 On the master node (foo) only:
 145
 146 {{{
 147     gnt-cluster init \
 148       --master-netdev br0 \
 149       --vg-name vg_ganeti \
 150       --secondary-ip E.F.G.2 \
 151       --enabled-hypervisors kvm \
 152       --nic-parameters link=br0 \
 153       --mac-prefix 00:16:37 \
 154       --no-ssh-init \
 155       --no-etc-hosts \
 156       --hypervisor-parameters kvm:initrd_path=,kernel_path= \
 157       foobar.debian.org
 158 }}}
 159
 160 Note the following:
 161
 162 * the master network device is set to br0, matching the public network bridge interface created above
 163 * the volume group is set to vg_ganeti, matching the volume group created above
 164 * the secondary IP address is set to the value of the master node's interface on the private network
 165 * the nic parameters for instances is set to use br0 as default bridge
 166 * the MAC prefix is registered in the dsa-kvm git repo
 167
 168 === add slave nodes ===
 169
 170 For each slave node (only bar for this example):
 171
 172 on the slave, append the master's /etc/ssh/ssh_host_rsa_key.pub to
 173 /etc/ssh/userkeys/root.  This is only required temporarily - once
 174 everything works, puppet will put it/keep it there.
 175
 176 on the master node (foo):
 177
 178 {{{
 179     gnt-node add \
 180       --secondary-ip E.F.G.3 \
 181       --no-ssh-key-check \
 182       --no-node-setup \
 183       bar.debian.org
 184 }}}
 185
 186 more stuff:
 187
 188 {{{
 189   gnt-cluster modify --reserved-lvs='vg0/local-swap.*'
 190   maybe: gnt-cluster modify --nic-parameters mode=openvswitch
 191 }}}
 192
 193 Note the following:
 194
 195 * the secondary IP address is set to the value of the slave node's interface on the private network
 196
 197 === verify cluster ===
 198
 199 On the master node (foo):
 200
 201 {{{
 202     gnt-cluster verify
 203 }}}
 204
 205 If everything has been configured correctly, no errors should be reported.
 206
 207 === create the 'noop' variant ===
 208
 209 Ensure that the ganeti-os-noop is installed.
 210
 211 ----
 212
 213 == How To Install Ganeti Instances ==
 214
 215 Suppose that qux.debian.org will be an instance (a virtual machine) hosted on
 216 the foobar.debian.org ganeti cluster.
 217
 218 Before adding the instance, an LDAP entry must be created so that an A record
 219 for the instance (A.B.C.4) exists.
 220
 221 === create the instance ===
 222
 223 On the master node (foo):
 224
 225 {{{
 226     gnt-instance add \
 227       --node foo:bar \
 228       --disk-template drbd \
 229       --os-size 4GiB \
 230       --os-type debootstrap+dsa \
 231       --hypervisor-parameters kvm:initrd_path=,kernel_path= \
 232       --net 0:ip=A.B.C.4 \
 233       qux.debian.org
 234 }}}
 235
 236 Note the following:
 237
 238 * the primary and secondary nodes have been explicitly set
 239 * the operating system type is 'debootstrap+dsa'
 240 * the network interfarce 0 (eth0 on the system) is set to the instance's interface on the public network
 241 * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check.  Be careful that the hostname and IP address are not taken already!
 242
 243 ----
 244
 245 == Variations ==
 246
 247 If the instances require access to the private network, then there are two modifications necessary.
 248
 249 === re-configure networking ===
 250
 251 On the nodes, ensure that br1 is configured (rather than eth1).
 252
 253 This is the interfaces file for foo.debian.org:
 254
 255 {{{
 256     auto br0
 257     iface br0 inet static
 258       bridge_ports eth0
 259       bridge_maxwait 0
 260       bridge_fd 0
 261       address A.B.C.2
 262       netmask 255.255.255.0
 263       gateway A.B.C.254
 264       up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
 265
 266     auto br1
 267     iface br1 inet static
 268       bridge_ports eth1
 269       bridge_maxwait 0
 270       bridge_fd 0
 271       address E.F.G.2
 272       netmask 255.255.255.0
 273       up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
 274 }}}
 275
 276 This is the interfaces file for bar.debian.org:
 277
 278 {{{
 279     auto br0
 280     iface br0 inet static
 281       bridge_ports eth0
 282       bridge_maxwait 0
 283       bridge_fd 0
 284       address A.B.C.3
 285       netmask 255.255.255.0
 286       gateway A.B.C.254
 287       up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
 288
 289     auto br1
 290     iface br1 inet static
 291       bridge_ports eth1
 292       bridge_maxwait 0
 293       bridge_fd 0
 294       address E.F.G.3
 295       netmask 255.255.255.0
 296       up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
 297 }}}
 298
 299 === create or update the instance  ===
 300
 301 When creating the instance, indicate both networks:
 302
 303 {{{
 304     gnt-instance add \
 305       --node foo:bar \
 306       --disk-template drbd \
 307       --os-size 4GiB \
 308       --os-type debootstrap+dsa \
 309       --hypervisor-parameters kvm:initrd_path=,kernel_path= \
 310       --net 0:ip=A.B.C.4 \
 311       --net 1:link=br1,ip=E.F.G.4 \
 312       qux.debian.org
 313 }}}
 314
 315 * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check.  Be careful that the hostname and IP address are not taken already!
 316
 317 When updating an existing instance, add the interface:
 318
 319 {{{
 320     gnt-instance shutdown qux.debian.org
 321     gnt-instance modify \
 322       --net add:link=br1,ip=E.F.G.4 \
 323       qux.debian.org
 324     gnt-instance startup qux.debian.org
 325 }}}
 326
 327 Please note that the hook scripts are run only at instance instantiation.  When
 328 adding interfaces to an instance, the guest opearting system must be updated
 329 manually.
 330
 331
 332 * If you are importing an instance from libvirt with LVM setup, you can adopt LVs:
 333
 334 {{{
 335     gnt-instance add -t plain --os-type debootstrap+dsa-wheezy \
 336       --disk 0:adopt=lully-boot \
 337       --disk 1:adopt=lully-root \
 338       --disk 2:adopt=lully-swap \
 339       --disk 3:adopt=lully-log  \
 340       --hypervisor-parameters kvm:initrd_path=,kernel_path= \
 341       --net 0:ip=82.195.75.99 -n clementi.debian.org  lully.debian.org
 342 }}}
 343
 344 And you want to convert it to use DRBD afterwards and start it on the other cluster node, so we can ensure that DRBD is correctly working.
 345 {{{
 346     gnt-instance shutdown lully.debian.org
 347     gnt-instance modify -t drbd -n czerny.debian.org lully.debian.org
 348     gnt-instance failover lully.debian.org
 349     gnt-instance startup lully.debian.org
 350 }}}
 351
 352 * Some instances NEED ide instead of virtio
 353
 354 {{{
 355     gnt-instance modify --hypervisor-parameters disk_type=ide fils.debian.org
 356 }}}
 357
 358 * To import instances with SAN volumes
 359
 360 {{{
 361     gnt-instance add -t blockdev --os-type debootstrap+dsa \
 362       --disk 0:adopt=/dev/disk/by-id/scsi-reger-boot \
 363       --disk 1:adopt=/dev/disk/by-id/scsi-reger-root \
 364       --hypervisor-parameters kvm:initrd_path=,kernel_path= \
 365       --net 0:ip=206.12.19.124 -n rossini.debian.org reger.debian.org
 366 }}}
 367
 368 * How to add new LUNs to Bytemark Cluster
 369
 370 ** Add new LUN to MSA and export to all blades
 371
 372 {{{
 373   Log into MSA controller
 374
 375   Choose which vdisk to use, use "show vdisks" to list
 376
 377 Add the volume:
 378   # create volume vdisk msa2k-2-500gr10 size 5G donizetti
 379
 380 Find a free LUN:
 381
 382   # show lun-maps
 383   or (if we assume they are all the same)
 384   # show host-maps 3001438001287090
 385
 386 Make a note of the next free LUN
 387
 388 Generate map commands for all blades, all ports, run locally:
 389
 390   $ for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "map volume donizetti lun 27 host bm-bl$bl-p$p" ; done ; done
 391
 392 Paste the output into the MSA shell
 393
 394 Find the WWN by doing show host-maps and looking for the volume name.
 395 Transform it using the sed run at the top of /etc/multipath.conf:
 396
 397 echo "$WWN" | sed -re 's#(.{6})(.{6})0000(.{2})(.*)#36\1000\2\3\4#'
 398 }}}
 399
 400 {{{
 401
 402   then:
 403   gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"
 404
 405   then:
 406   reload multipath-tools on gnt-master (normaly bm-bl1):
 407   service multipath-tools reload
 408   add the WWNs to dsa-puppet/modules/multipath/files/multipath-bm.conf and define the alias and commit that file to git.
 409
 410   then:
 411   gnt-cluster command "puppet agent -t"
 412
 413   This will update the multipath config on all cluster nodes. WITHOUT doing this, you can't migrate VMs between nodes.
 414 }}}
 415
 416 ** Remove LUNs.
 417
 418 Order is important, or else things get very, very confused and the world needs a reboot.
 419
 420 *** Make sure nothing uses the volume anymore.
 421
 422 *** Make sure we do not have any partitions lying around for it:
 423 {{{
 424   gnt-cluster command "ls -l /dev/mapper/backuphost*"
 425   # and maybe:
 426   gnt-cluster command "kpartx -v -p -part -d /dev/mapper/backuphost"
 427 }}}
 428
 429 *** flush the device, remove the multipath mapping, flush all backing devices:
 430 {{{
 431   root@bm-bl1:~# cat flush-mp-device
 432   #!/bin/sh
 433
 434   dev="$1"
 435
 436   if [ -z "$dev" ] || ! [ -e "$dev" ]; then
 437     echo 2>&1 "Device $dev does not exist."
 438     exit 1
 439   fi
 440
 441   devs=$(multipath -ll "$dev" | grep cciss | sed -e 's/.*cciss!//; s/ .*//;')
 442
 443   if [ -z "$devs" ]; then
 444     echo 2>&1 "No backends found for $dev."
 445     exit 1
 446   fi
 447
 448   set -e
 449
 450   blockdev --flushbufs "$dev"
 451   multipath -f "$dev"
 452   for d in $devs; do
 453     blockdev --flushbufs "/dev/cciss/$d"
 454   done
 455   echo done.
 456 }}}
 457 {{{
 458   gnt-cluster command "/root/flush-mp-device /dev/mapper/backuphost"
 459 }}}
 460
 461 *** Immediately afterwards, paste the output of the following to the MSA console.  Best prepare this before, so you do it quickly before anything else rescans stuff, reloads or restarts multipathd, and the devices become used again.
 462 {{{
 463   for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "unmap volume DECOMMISSION-backuph host bm-bl$bl-p$p" ; done ; done
 464 }}}
 465
 466 *** Lastly, rescan the scsi bus on all hosts.  Do not forget that.  hpacucli and the monitoring tools might lock up the machine if they try to check the status of a device that now no longer exists but that the system still thinkgs is around.
 467 {{{
 468   gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"
 469 }}}
 470
 471
 472 === DRBD optimization ===
 473
 474 The default DRBD parameters are not really optimized, which means very slow (re)syncing. The
 475 following commands might help to make it faster. Of course the max speed can be increased if
 476 both the network and disk speed allow that.
 477
 478 {{{
 479     gnt-cluster modify -D drbd:net-custom="--max-buffers 36k --sndbuf-size 1024k --rcvbuf-size 2048k"
 480     gnt-cluster modify -D drbd:c-min-rate=32768
 481     gnt-cluster modify -D drbd:c-max-rate=98304
 482     gnt-cluster modify -D drbd:resync-rate=98304
 483 }}}
 484
 485
 486 === Change the disk cache ===
 487
 488 When using raw volumes or partitions, it is best to avoid the host cache completely to reduce data copies
 489 and bus traffic. This can be done using:
 490
 491 {{{
 492     gnt-cluster modify -H kvm:disk_cache=none
 493 }}}
 494
 495
 496 === Change the CPU type ===
 497
 498 Modern processors come with a wide variety of additional instruction sets (SSE, AES-NI, etc.) which vary from processor to processor, but can greatly improve the performance depending on the workload. Ganeti and QEMU default to a compatible subset of cpu features called qemu64, so that if the host processor is changed, or a live migration is performed, the guest will see its CPUfeatures unchanged. This is great for compatibility but comes at a performance cost.
 499
 500 The CPU presented to the guests can easily be changed, using the cpu_type option in Ganeti hypervisor options. However to still be able to live-migrate VMs from one host to another, the CPU presented to the guest should be the common denominator of all hosts in the cluster. Otherwise a live migration between two different CPU types could crash the instance.
 501
 502 For homogeneous clusters it is possible to use the host cpu type:
 503
 504 {{{
 505   gnt-cluster modify -H kvm:cpu_type='host'
 506 }}}
 507
 508 Otherwise QEMU provides a set of generic CPU for each generation, that can be queried that way:
 509
 510 {{{
 511 $ qemu-system-x86_64 -cpu ?
 512
 513 x86           qemu64  QEMU Virtual CPU version 2.1.2
 514 x86           phenom  AMD Phenom(tm) 9550 Quad-Core Processor
 515 x86         core2duo  Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
 516 x86            kvm64  Common KVM processor
 517 x86           qemu32  QEMU Virtual CPU version 2.1.2
 518 x86            kvm32  Common 32-bit KVM processor
 519 x86          coreduo  Genuine Intel(R) CPU           T2600  @ 2.16GHz
 520 x86              486
 521 x86          pentium
 522 x86         pentium2
 523 x86         pentium3
 524 x86           athlon  QEMU Virtual CPU version 2.1.2
 525 x86             n270  Intel(R) Atom(TM) CPU N270   @ 1.60GHz
 526 x86           Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)
 527 x86           Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)
 528 x86          Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)
 529 x86         Westmere  Westmere E56xx/L56xx/X56xx (Nehalem-C)
 530 x86      SandyBridge  Intel Xeon E312xx (Sandy Bridge)
 531 x86          Haswell  Intel Core Processor (Haswell)
 532 x86        Broadwell  Intel Core Processor (Broadwell)
 533 x86       Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)
 534 x86       Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)
 535 x86       Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)
 536 x86       Opteron_G4  AMD Opteron 62xx class CPU
 537 x86       Opteron_G5  AMD Opteron 63xx class CPU
 538 x86             host  KVM processor with all supported host features (only available in KVM mode)
 539
 540 Recognized CPUID flags:
 541   pbe ia64 tm ht ss sse2 sse fxsr mmx acpi ds clflush pn pse36 pat cmov mca pge mtrr sep apic cx8 mce pae msr tsc pse de vme fpu
 542   hypervisor rdrand f16c avx osxsave xsave aes tsc-deadline popcnt movbe x2apic sse4.2|sse4_2 sse4.1|sse4_1 dca pcid pdcm xtpr cx16 fma cid ssse3 tm2 est smx vmx ds_cpl monitor dtes64 pclmulqdq|pclmuldq pni|sse3
 543   smap adx rdseed rtm invpcid erms bmi2 smep avx2 hle bmi1 fsgsbase
 544   3dnow 3dnowext lm|i64 rdtscp pdpe1gb fxsr_opt|ffxsr mmxext nx|xd syscall
 545   perfctr_nb perfctr_core topoext tbm nodeid_msr tce fma4 lwp wdt skinit xop ibs osvw 3dnowprefetch misalignsse sse4a abm cr8legacy extapic svm cmp_legacy lahf_lm
 546   invtsc
 547   pmm-en pmm phe-en phe ace2-en ace2 xcrypt-en xcrypt xstore-en xstore
 548   kvmclock-stable-bit kvm_pv_unhalt kvm_pv_eoi kvm_steal_time kvm_asyncpf kvmclock kvm_mmu kvm_nopiodelay kvmclock
 549   pfthreshold pause_filter decodeassists flushbyasid vmcb_clean tsc_scale nrip_save svm_lock lbrv npt
 550 }}}
 551
 552 For example on a cluster using both Sandy Bridge and Haswell CPU, the following command can be used:
 553 {{{
 554   gnt-cluster modify -H kvm:cpu_type='SandyBridge'
 555 }}}
 556
 557 Here is a typical improvement one can get on the AES openssl benchmarks.
 558
 559 With the default qemu64 CPU type:
 560 {{{
 561   type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
 562   aes-128-cbc     175481.21k   195151.55k   199307.09k   201209.51k   201359.36k
 563   aes-128-gcm      49971.64k    57688.17k   135092.14k   144172.37k   146511.19k
 564   aes-256-cbc     130209.34k   141268.76k   142547.54k   144185.00k   144777.22k
 565   aes-256-gcm      39249.19k    44492.61k   114492.76k   123000.83k   125501.44k
 566
 567 }}}
 568
 569 With the SandyBridge CPU type:
 570 {{{
 571   type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
 572   aes-128-cbc     376040.16k   477377.32k   484083.37k   391323.31k   389589.67k
 573   aes-128-gcm     215921.26k   592407.87k   777246.21k   836795.39k   835971.75k
 574   aes-256-cbc     309840.39k   328612.18k   330784.68k   324245.16k   328116.91k
 575   aes-256-gcm     160820.14k   424322.20k   557212.50k   599435.61k   610459.65k
 576 }}}
 577
 578 === Add a virtio-rng device ===
 579
 580 VirtIO RNG (random number generator) is a paravirtualized device that is exposed as a hardware RNG device to the guest. Virtio RNG just appears as a regular hardware RNG to the guest, which the kernel reads from to fill its entropy pool. Unfortunately Ganeti does not support it natively, therefore the kvm_extra option should be used. Ganeti forces the allocation of the PCI devices to specific slots, which means it is not possible to use the QEMU autoallocation and that an explicit PCI slot has to be provided. There 32 possible slots on the default QEMU machine, so we can use one of the last ones for example 0x1e.
 581
 582 The final command to add a virtio-rng device cluster-wise is therefore:
 583 {{{
 584   gnt-cluster modify -H kvm:kvm_extra="-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000"
 585 }}}
 586
 587 The max-bytes and period options limit the entropy rate a guest can get to 1kB/s.