1 == How To Install Ganeti Clusters and Instances ==
5 Suppose that there are two identical hosts: foo.debian.org and bar.debian.org.
7 They are running stable and have been integrated into Debian infrastructure.
9 They will serve as nodes in a ganeti cluster named foobar.debian.org.
11 They have a RAID1 array exposing three partitions: c0d0p1 for /, c0d0p2 for
12 swap and c0d0p3 for lvm volume groups to be used by ganeti via drbd.
14 They have two network interfaces: eth0 (public) and eth1 (private).
16 The public network is A.B.C.0/24 with gateway A.B.C.254.
18 The private network is E.F.G.0/24 with no gateway.
20 Suppose that the first instance to be hosted on foobar.debian.org is
23 The following DNS records exist:
26 foobar.debian.org. IN A A.B.C.1
27 foo.debian.org. IN A A.B.C.2
28 bar.debian.org. IN A A.B.C.3
29 qux.debian.org. IN A A.B.C.4
30 foo.debprivate-hoster.debian.org. IN A E.F.G.2
31 bar.debprivate-hoster.debian.org. IN A E.F.G.3
34 === install required packages ===
36 On each node, install the required packages:
39 # maybe: apt-get install drbd-utils
40 # maybe: apt-get install ganeti-instance-debootstrap
41 apt-get install ganeti2 ganeti-htools qemu-kvm
44 === configure kernel modules ===
46 On each node, ensure that the required kernel modules are loaded at boot:
49 ainsl /etc/modules 'drbd minor_count=255 usermode_helper=/bin/true'
50 ainsl /etc/modules 'hmac'
51 ainsl /etc/modules 'tun'
52 ainsl /etc/modules 'ext3'
53 ainsl /etc/modules 'ext4'
56 === configure networking ===
58 On each node, ensure that br0 (not eth0) and eth1 are configured.
60 The bridge interface, br0, is used by the guest virtual machines to reach the
63 If the guest virtual machines need to access the private network, then br1
64 should be configured rather than eth1.
66 To prevent the link address changing due to startup/shutdown of virtual
67 machines, explicitly set the value.
69 This is the interfaces file for foo.debian.org:
80 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
83 iface eth1 inet static
88 This is the interfaces file for bar.debian.org:
99 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
102 iface eth1 inet static
104 netmask 255.255.255.0
107 === configure lvm ===
109 On each node, configure lvm to ignore drbd devices and to prefer
110 {{{/dev/cciss}}} devices names over {{{/dev/block}}} device names
111 ([[https://code.google.com/p/ganeti/issues/detail?id=93|why?]]):
115 -e 's#^\(\s*filter\s\).*#\1= [ "a|.*|", "r|/dev/drbd[0-9]+|" ]#' \
116 -e 's#^\(\s*preferred_names\s\).*#\1= [ "^/dev/dm-*/", "^/dev/cciss/" ]#' \
121 === create lvm volume groups ===
123 On each node, create a volume group:
126 vgcreate vg_ganeti /dev/cciss/c0d0p3
129 === exchange ssh keys ===
134 mkdir -m 0700 -p /root/.ssh &&
135 ln -s /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa
138 === configure iptables (via ferm) ===
140 the nodes must connect to each other over the public and private networks for a number of reasons; see the ganeti2 module in puppet
142 === instantiate the cluster ===
144 On the master node (foo) only:
148 --master-netdev br0 \
149 --vg-name vg_ganeti \
150 --secondary-ip E.F.G.2 \
151 --enabled-hypervisors kvm \
152 --nic-parameters link=br0 \
153 --mac-prefix 00:16:37 \
156 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
162 * the master network device is set to br0, matching the public network bridge interface created above
163 * the volume group is set to vg_ganeti, matching the volume group created above
164 * the secondary IP address is set to the value of the master node's interface on the private network
165 * the nic parameters for instances is set to use br0 as default bridge
166 * the MAC prefix is registered in the dsa-kvm git repo
168 === add slave nodes ===
170 For each slave node (only bar for this example):
172 on the slave, append the master's /etc/ssh/ssh_host_rsa_key.pub to
173 /etc/ssh/userkeys/root. This is only required temporarily - once
174 everything works, puppet will put it/keep it there.
176 on the master node (foo):
180 --secondary-ip E.F.G.3 \
189 gnt-cluster modify --reserved-lvs='vg0/local-swap.*'
190 maybe: gnt-cluster modify --nic-parameters mode=openvswitch
195 * the secondary IP address is set to the value of the slave node's interface on the private network
197 === verify cluster ===
199 On the master node (foo):
205 If everything has been configured correctly, no errors should be reported.
207 === create the 'noop' variant ===
209 Ensure that the ganeti-os-noop is installed.
213 == How To Install Ganeti Instances ==
215 Suppose that qux.debian.org will be an instance (a virtual machine) hosted on
216 the foobar.debian.org ganeti cluster.
218 Before adding the instance, an LDAP entry must be created so that an A record
219 for the instance (A.B.C.4) exists.
221 === create the instance ===
223 On the master node (foo):
228 --disk-template drbd \
230 --os-type debootstrap+dsa \
231 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
238 * the primary and secondary nodes have been explicitly set
239 * the operating system type is 'debootstrap+dsa'
240 * the network interfarce 0 (eth0 on the system) is set to the instance's interface on the public network
241 * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check. Be careful that the hostname and IP address are not taken already!
247 If the instances require access to the private network, then there are two modifications necessary.
249 === re-configure networking ===
251 On the nodes, ensure that br1 is configured (rather than eth1).
253 This is the interfaces file for foo.debian.org:
257 iface br0 inet static
262 netmask 255.255.255.0
264 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
267 iface br1 inet static
272 netmask 255.255.255.0
273 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
276 This is the interfaces file for bar.debian.org:
280 iface br0 inet static
285 netmask 255.255.255.0
287 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
290 iface br1 inet static
295 netmask 255.255.255.0
296 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
299 === create or update the instance ===
301 When creating the instance, indicate both networks:
306 --disk-template drbd \
308 --os-type debootstrap+dsa \
309 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
311 --net 1:link=br1,ip=E.F.G.4 \
315 * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check. Be careful that the hostname and IP address are not taken already!
317 When updating an existing instance, add the interface:
320 gnt-instance shutdown qux.debian.org
321 gnt-instance modify \
322 --net add:link=br1,ip=E.F.G.4 \
324 gnt-instance startup qux.debian.org
327 Please note that the hook scripts are run only at instance instantiation. When
328 adding interfaces to an instance, the guest opearting system must be updated
332 * If you are importing an instance from libvirt with LVM setup, you can adopt LVs:
335 gnt-instance add -t plain --os-type debootstrap+dsa-wheezy \
336 --disk 0:adopt=lully-boot \
337 --disk 1:adopt=lully-root \
338 --disk 2:adopt=lully-swap \
339 --disk 3:adopt=lully-log \
340 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
341 --net 0:ip=82.195.75.99 -n clementi.debian.org lully.debian.org
344 And you want to convert it to use DRBD afterwards and start it on the other cluster node, so we can ensure that DRBD is correctly working.
346 gnt-instance shutdown lully.debian.org
347 gnt-instance modify -t drbd -n czerny.debian.org lully.debian.org
348 gnt-instance failover lully.debian.org
349 gnt-instance startup lully.debian.org
352 * Some instances NEED ide instead of virtio
355 gnt-instance modify --hypervisor-parameters disk_type=ide fils.debian.org
358 * To import instances with SAN volumes
361 gnt-instance add -t blockdev --os-type debootstrap+dsa \
362 --disk 0:adopt=/dev/disk/by-id/scsi-reger-boot \
363 --disk 1:adopt=/dev/disk/by-id/scsi-reger-root \
364 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
365 --net 0:ip=206.12.19.124 -n rossini.debian.org reger.debian.org
368 * How to add new LUNs to Bytemark Cluster
370 ** Add new LUN to MSA and export to all blades
373 Log into MSA controller
375 Choose which vdisk to use, use "show vdisks" to list
378 # create volume vdisk msa2k-2-500gr10 size 5G donizetti
383 or (if we assume they are all the same)
384 # show host-maps 3001438001287090
386 Make a note of the next free LUN
388 Generate map commands for all blades, all ports, run locally:
390 $ for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "map volume donizetti lun 27 host bm-bl$bl-p$p" ; done ; done
392 Paste the output into the MSA shell
394 Find the WWN by doing show host-maps and looking for the volume name.
395 Transform it using the sed run at the top of /etc/multipath.conf:
397 echo "$WWN" | sed -re 's#(.{6})(.{6})0000(.{2})(.*)#36\1000\2\3\4#'
403 gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"
406 reload multipath-tools on gnt-master (normaly bm-bl1):
407 service multipath-tools reload
408 add the WWNs to dsa-puppet/modules/multipath/files/multipath-bm.conf and define the alias and commit that file to git.
411 gnt-cluster command "puppet agent -t"
413 This will update the multipath config on all cluster nodes. WITHOUT doing this, you can't migrate VMs between nodes.
418 Order is important, or else things get very, very confused and the world needs a reboot.
420 *** Make sure nothing uses the volume anymore.
422 *** Make sure we do not have any partitions lying around for it:
424 gnt-cluster command "ls -l /dev/mapper/backuphost*"
426 gnt-cluster command "kpartx -v -p -part -d /dev/mapper/backuphost"
429 *** flush the device, remove the multipath mapping, flush all backing devices:
431 root@bm-bl1:~# cat flush-mp-device
436 if [ -z "$dev" ] || ! [ -e "$dev" ]; then
437 echo 2>&1 "Device $dev does not exist."
441 devs=$(multipath -ll "$dev" | grep cciss | sed -e 's/.*cciss!//; s/ .*//;')
443 if [ -z "$devs" ]; then
444 echo 2>&1 "No backends found for $dev."
450 blockdev --flushbufs "$dev"
453 blockdev --flushbufs "/dev/cciss/$d"
458 gnt-cluster command "/root/flush-mp-device /dev/mapper/backuphost"
461 *** Immediately afterwards, paste the output of the following to the MSA console. Best prepare this before, so you do it quickly before anything else rescans stuff, reloads or restarts multipathd, and the devices become used again.
463 for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "unmap volume DECOMMISSION-backuph host bm-bl$bl-p$p" ; done ; done
466 *** Lastly, rescan the scsi bus on all hosts. Do not forget that. hpacucli and the monitoring tools might lock up the machine if they try to check the status of a device that now no longer exists but that the system still thinkgs is around.
468 gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"
472 === DRBD optimization ===
474 The default DRBD parameters are not really optimized, which means very slow (re)syncing. The
475 following commands might help to make it faster. Of course the max speed can be increased if
476 both the network and disk speed allow that.
479 gnt-cluster modify -D drbd:net-custom="--max-buffers 36k --sndbuf-size 1024k --rcvbuf-size 2048k"
480 gnt-cluster modify -D drbd:c-min-rate=32768
481 gnt-cluster modify -D drbd:c-max-rate=98304
482 gnt-cluster modify -D drbd:resync-rate=98304
486 === Change the disk cache ===
488 When using raw volumes or partitions, it is best to avoid the host cache completely to reduce data copies
489 and bus traffic. This can be done using:
492 gnt-cluster modify -H kvm:disk_cache=none
496 === Change the CPU type ===
498 Modern processors come with a wide variety of additional instruction sets (SSE, AES-NI, etc.) which vary from processor to processor, but can greatly improve the performance depending on the workload. Ganeti and QEMU default to a compatible subset of cpu features called qemu64, so that if the host processor is changed, or a live migration is performed, the guest will see its CPUfeatures unchanged. This is great for compatibility but comes at a performance cost.
502 The CPU presented to the guests can easily be changed, using the cpu_type option in Ganeti hypervisor options. However to still be able to live-migrate VMs from one host to another, the CPU presented to the guest should be the common denominator of all hosts in the cluster. Otherwise a live migration between two different CPU types could crash the instance.
504 For homogeneous clusters it is possible to use the host cpu type:
507 gnt-cluster modify -H kvm:cpu_type='host'
510 Otherwise QEMU provides a set of generic CPU for each generation, that can be queried that way:
513 $ qemu-system-x86_64 -cpu ?
515 x86 qemu64 QEMU Virtual CPU version 2.1.2
516 x86 phenom AMD Phenom(tm) 9550 Quad-Core Processor
517 x86 core2duo Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
518 x86 kvm64 Common KVM processor
519 x86 qemu32 QEMU Virtual CPU version 2.1.2
520 x86 kvm32 Common 32-bit KVM processor
521 x86 coreduo Genuine Intel(R) CPU T2600 @ 2.16GHz
526 x86 athlon QEMU Virtual CPU version 2.1.2
527 x86 n270 Intel(R) Atom(TM) CPU N270 @ 1.60GHz
528 x86 Conroe Intel Celeron_4x0 (Conroe/Merom Class Core 2)
529 x86 Penryn Intel Core 2 Duo P9xxx (Penryn Class Core 2)
530 x86 Nehalem Intel Core i7 9xx (Nehalem Class Core i7)
531 x86 Westmere Westmere E56xx/L56xx/X56xx (Nehalem-C)
532 x86 SandyBridge Intel Xeon E312xx (Sandy Bridge)
533 x86 Haswell Intel Core Processor (Haswell)
534 x86 Broadwell Intel Core Processor (Broadwell)
535 x86 Opteron_G1 AMD Opteron 240 (Gen 1 Class Opteron)
536 x86 Opteron_G2 AMD Opteron 22xx (Gen 2 Class Opteron)
537 x86 Opteron_G3 AMD Opteron 23xx (Gen 3 Class Opteron)
538 x86 Opteron_G4 AMD Opteron 62xx class CPU
539 x86 Opteron_G5 AMD Opteron 63xx class CPU
540 x86 host KVM processor with all supported host features (only available in KVM mode)
542 Recognized CPUID flags:
543 pbe ia64 tm ht ss sse2 sse fxsr mmx acpi ds clflush pn pse36 pat cmov mca pge mtrr sep apic cx8 mce pae msr tsc pse de vme fpu
544 hypervisor rdrand f16c avx osxsave xsave aes tsc-deadline popcnt movbe x2apic sse4.2|sse4_2 sse4.1|sse4_1 dca pcid pdcm xtpr cx16 fma cid ssse3 tm2 est smx vmx ds_cpl monitor dtes64 pclmulqdq|pclmuldq pni|sse3
545 smap adx rdseed rtm invpcid erms bmi2 smep avx2 hle bmi1 fsgsbase
546 3dnow 3dnowext lm|i64 rdtscp pdpe1gb fxsr_opt|ffxsr mmxext nx|xd syscall
547 perfctr_nb perfctr_core topoext tbm nodeid_msr tce fma4 lwp wdt skinit xop ibs osvw 3dnowprefetch misalignsse sse4a abm cr8legacy extapic svm cmp_legacy lahf_lm
549 pmm-en pmm phe-en phe ace2-en ace2 xcrypt-en xcrypt xstore-en xstore
550 kvmclock-stable-bit kvm_pv_unhalt kvm_pv_eoi kvm_steal_time kvm_asyncpf kvmclock kvm_mmu kvm_nopiodelay kvmclock
551 pfthreshold pause_filter decodeassists flushbyasid vmcb_clean tsc_scale nrip_save svm_lock lbrv npt
554 For example on a cluster using both Sandy Bridge and Haswell CPU, the following command can be used:
556 gnt-cluster modify -H kvm:cpu_type='SandyBridge'
559 Here is a typical improvement one can get on the AES openssl benchmarks.
561 With the default qemu64 CPU type:
563 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
564 aes-128-cbc 175481.21k 195151.55k 199307.09k 201209.51k 201359.36k
565 aes-128-gcm 49971.64k 57688.17k 135092.14k 144172.37k 146511.19k
566 aes-256-cbc 130209.34k 141268.76k 142547.54k 144185.00k 144777.22k
567 aes-256-gcm 39249.19k 44492.61k 114492.76k 123000.83k 125501.44k
571 With the SandyBridge CPU type:
573 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
574 aes-128-cbc 376040.16k 477377.32k 484083.37k 391323.31k 389589.67k
575 aes-128-gcm 215921.26k 592407.87k 777246.21k 836795.39k 835971.75k
576 aes-256-cbc 309840.39k 328612.18k 330784.68k 324245.16k 328116.91k
577 aes-256-gcm 160820.14k 424322.20k 557212.50k 599435.61k 610459.65k
582 There are two KVM implementations on POWER, the KVM-PR (kvm-pr.ko) one which uses the "PRoblem state" of the ppc CPUs to run the guests and the KVM-HV (kvm-hv.ko) one uses the hardware virtualization support of the POWER CPU. In the later case the guest CPU has to be of the same type as the host CPU. However, there is at least the possibility to run the guest in a backward-compatibility mode of the previous CPU generation by using the compat parameter:
584 gnt-cluster modify -H kvm:cpu_type='host\,compat=power8'
587 === Add a virtio-rng device ===
589 VirtIO RNG (random number generator) is a paravirtualized device that is exposed as a hardware RNG device to the guest. Virtio RNG just appears as a regular hardware RNG to the guest, which the kernel reads from to fill its entropy pool. Unfortunately Ganeti does not support it natively, therefore the kvm_extra option should be used. Ganeti forces the allocation of the PCI devices to specific slots, which means it is not possible to use the QEMU autoallocation and that an explicit PCI slot has to be provided. There 32 possible slots on the default QEMU machine, so we can use one of the last ones for example 0x1e.
591 The final command to add a virtio-rng device cluster-wise is therefore:
593 gnt-cluster modify -H kvm:kvm_extra="-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000"
596 The max-bytes and period options limit the entropy rate a guest can get to 1kB/s.
598 === POWER specific settings ===
600 On POWER the Ganeti doesn't enable the KVM module by default. Therefore the -enable-kvm option has to be passed.
602 In addition by disabling the video card automatically makes the guest (firmware, grub, kernel) to use the serial console. This can be done with the -vga none option.
604 The command to setup KVM on POWER cluster-wise is therefore the following one (possibly combined with the vritio-rng one):
607 gnt-cluster modify -H kvm:kvm_extra="-enable-kvm -vga none"