1 == How To Install Ganeti Clusters and Instances ==
5 Suppose that there are two identical hosts: foo.debian.org and bar.debian.org.
7 They are running stable and have been integrated into Debian infrastructure.
9 They will serve as nodes in a ganeti cluster named foobar.debian.org.
11 They have a RAID1 array exposing three partitions: c0d0p1 for /, c0d0p2 for
12 swap and c0d0p3 for lvm volume groups to be used by ganeti via drbd.
14 They have two network interfaces: eth0 (public) and eth1 (private).
16 The public network is A.B.C.0/24 with gateway A.B.C.254.
18 The private network is E.F.G.0/24 with no gateway.
20 Suppose that the first instance to be hosted on foobar.debian.org is
23 The following DNS records exist:
26 foobar.debian.org. IN A A.B.C.1
27 foo.debian.org. IN A A.B.C.2
28 bar.debian.org. IN A A.B.C.3
29 qux.debian.org. IN A A.B.C.4
30 foo.debprivate-hoster.debian.org. IN A E.F.G.2
31 bar.debprivate-hoster.debian.org. IN A E.F.G.3
34 === install required packages ===
36 On each node, install the required packages:
39 # maybe: apt-get install drbd-utils
40 # maybe: apt-get install ganeti-instance-debootstrap
41 apt-get install ganeti2 ganeti-htools qemu-kvm
44 === configure kernel modules ===
46 On each node, ensure that the required kernel modules are loaded at boot:
49 ainsl /etc/modules 'drbd minor_count=255 usermode_helper=/bin/true'
50 ainsl /etc/modules 'hmac'
51 ainsl /etc/modules 'tun'
52 ainsl /etc/modules 'ext3'
53 ainsl /etc/modules 'ext4'
56 === configure networking ===
58 On each node, ensure that br0 (not eth0) and eth1 are configured.
60 The bridge interface, br0, is used by the guest virtual machines to reach the
63 If the guest virtual machines need to access the private network, then br1
64 should be configured rather than eth1.
66 To prevent the link address changing due to startup/shutdown of virtual
67 machines, explicitly set the value.
69 This is the interfaces file for foo.debian.org:
80 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
83 iface eth1 inet static
88 This is the interfaces file for bar.debian.org:
99 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
102 iface eth1 inet static
104 netmask 255.255.255.0
107 === configure lvm ===
109 On each node, configure lvm to ignore drbd devices and to prefer
110 {{{/dev/cciss}}} devices names over {{{/dev/block}}} device names
111 ([[https://code.google.com/p/ganeti/issues/detail?id=93|why?]]):
115 -e 's#^\(\s*filter\s\).*#\1= [ "a|.*|", "r|/dev/drbd[0-9]+|" ]#' \
116 -e 's#^\(\s*preferred_names\s\).*#\1= [ "^/dev/dm-*/", "^/dev/cciss/" ]#' \
121 === create lvm volume groups ===
123 On each node, create a volume group:
126 vgcreate vg_ganeti /dev/cciss/c0d0p3
129 === exchange ssh keys ===
134 mkdir -m 0700 -p /root/.ssh &&
135 ln -s /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa
138 === configure iptables (via ferm) ===
140 the nodes must connect to each other over the public and private networks for a number of reasons; see the ganeti2 module in puppet
142 === instantiate the cluster ===
144 On the master node (foo) only:
148 --master-netdev br0 \
149 --vg-name vg_ganeti \
150 --secondary-ip E.F.G.2 \
151 --enabled-hypervisors kvm \
152 --nic-parameters link=br0 \
153 --mac-prefix 00:16:37 \
156 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
162 * the master network device is set to br0, matching the public network bridge interface created above
163 * the volume group is set to vg_ganeti, matching the volume group created above
164 * the secondary IP address is set to the value of the master node's interface on the private network
165 * the nic parameters for instances is set to use br0 as default bridge
166 * the MAC prefix is registered in the dsa-kvm git repo
168 === add slave nodes ===
170 For each slave node (only bar for this example):
172 on the slave, append the master's /etc/ssh/ssh_host_rsa_key.pub to
173 /etc/ssh/userkeys/root. This is only required temporarily - once
174 everything works, puppet will put it/keep it there.
176 on the master node (foo):
180 --secondary-ip E.F.G.3 \
189 gnt-cluster modify --reserved-lvs='vg0/local-swap.*'
190 maybe: gnt-cluster modify --nic-parameters mode=openvswitch
195 * the secondary IP address is set to the value of the slave node's interface on the private network
197 === verify cluster ===
199 On the master node (foo):
205 If everything has been configured correctly, no errors should be reported.
207 === create the 'noop' variant ===
209 Ensure that the ganeti-os-noop is installed.
213 == How To Install Ganeti Instances ==
215 Suppose that qux.debian.org will be an instance (a virtual machine) hosted on
216 the foobar.debian.org ganeti cluster.
218 Before adding the instance, an LDAP entry must be created so that an A record
219 for the instance (A.B.C.4) exists.
221 === create the instance ===
223 On the master node (foo):
228 --disk-template drbd \
230 --os-type debootstrap+dsa \
231 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
238 * the primary and secondary nodes have been explicitly set
239 * the operating system type is 'debootstrap+dsa'
240 * the network interfarce 0 (eth0 on the system) is set to the instance's interface on the public network
241 * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check. Be careful that the hostname and IP address are not taken already!
247 If the instances require access to the private network, then there are two modifications necessary.
249 === re-configure networking ===
251 On the nodes, ensure that br1 is configured (rather than eth1).
253 This is the interfaces file for foo.debian.org:
257 iface br0 inet static
262 netmask 255.255.255.0
264 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
267 iface br1 inet static
272 netmask 255.255.255.0
273 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
276 This is the interfaces file for bar.debian.org:
280 iface br0 inet static
285 netmask 255.255.255.0
287 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
290 iface br1 inet static
295 netmask 255.255.255.0
296 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE
299 === create or update the instance ===
301 When creating the instance, indicate both networks:
306 --disk-template drbd \
308 --os-type debootstrap+dsa \
309 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
311 --net 1:link=br1,ip=E.F.G.4 \
315 * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check. Be careful that the hostname and IP address are not taken already!
317 When updating an existing instance, add the interface:
320 gnt-instance shutdown qux.debian.org
321 gnt-instance modify \
322 --net add:link=br1,ip=E.F.G.4 \
324 gnt-instance startup qux.debian.org
327 Please note that the hook scripts are run only at instance instantiation. When
328 adding interfaces to an instance, the guest opearting system must be updated
332 * If you are importing an instance from libvirt with LVM setup, you can adopt LVs:
335 gnt-instance add -t plain --os-type debootstrap+dsa-wheezy \
336 --disk 0:adopt=lully-boot \
337 --disk 1:adopt=lully-root \
338 --disk 2:adopt=lully-swap \
339 --disk 3:adopt=lully-log \
340 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
341 --net 0:ip=82.195.75.99 -n clementi.debian.org lully.debian.org
344 And you want to convert it to use DRBD afterwards and start it on the other cluster node, so we can ensure that DRBD is correctly working.
346 gnt-instance shutdown lully.debian.org
347 gnt-instance modify -t drbd -n czerny.debian.org lully.debian.org
348 gnt-instance failover lully.debian.org
349 gnt-instance startup lully.debian.org
352 * Some instances NEED ide instead of virtio
355 gnt-instance modify --hypervisor-parameters disk_type=ide fils.debian.org
358 * To import instances with SAN volumes
361 gnt-instance add -t blockdev --os-type debootstrap+dsa \
362 --disk 0:adopt=/dev/disk/by-id/scsi-reger-boot \
363 --disk 1:adopt=/dev/disk/by-id/scsi-reger-root \
364 --hypervisor-parameters kvm:initrd_path=,kernel_path= \
365 --net 0:ip=206.12.19.124 -n rossini.debian.org reger.debian.org
368 * How to add new LUNs to Bytemark Cluster
370 ** Add new LUN to MSA and export to all blades
373 Log into MSA controller
375 Choose which vdisk to use, use "show vdisks" to list
378 # create volume vdisk msa2k-2-500gr10 size 5G donizetti
383 or (if we assume they are all the same)
384 # show host-maps 3001438001287090
386 Make a note of the next free LUN
388 Generate map commands for all blades, all ports, run locally:
390 $ for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "map volume donizetti lun 27 host bm-bl$bl-p$p" ; done ; done
392 Paste the output into the MSA shell
394 Find the WWN by doing show host-maps and looking for the volume name.
395 Transform it using the sed run at the top of /etc/multipath.conf:
397 echo "$WWN" | sed -re 's#(.{6})(.{6})0000(.{2})(.*)#36\1000\2\3\4#'
403 gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"
406 reload multipath-tools on gnt-master (normaly bm-bl1):
407 service multipath-tools reload
408 add the WWNs to dsa-puppet/modules/multipath/files/multipath-bm.conf and define the alias and commit that file to git.
411 gnt-cluster command "puppet agent -t"
413 This will update the multipath config on all cluster nodes. WITHOUT doing this, you can't migrate VMs between nodes.
418 Order is important, or else things get very, very confused and the world needs a reboot.
420 *** Make sure nothing uses the volume anymore.
422 *** Make sure we do not have any partitions lying around for it:
424 gnt-cluster command "ls -l /dev/mapper/backuphost*"
426 gnt-cluster command "kpartx -v -p -part -d /dev/mapper/backuphost"
429 *** flush the device, remove the multipath mapping, flush all backing devices:
431 root@bm-bl1:~# cat flush-mp-device
436 if [ -z "$dev" ] || ! [ -e "$dev" ]; then
437 echo 2>&1 "Device $dev does not exist."
441 devs=$(multipath -ll "$dev" | grep cciss | sed -e 's/.*cciss!//; s/ .*//;')
443 if [ -z "$devs" ]; then
444 echo 2>&1 "No backends found for $dev."
450 blockdev --flushbufs "$dev"
453 blockdev --flushbufs "/dev/cciss/$d"
458 gnt-cluster command "/root/flush-mp-device /dev/mapper/backuphost"
461 *** Immediately afterwards, paste the output of the following to the MSA console. Best prepare this before, so you do it quickly before anything else rescans stuff, reloads or restarts multipathd, and the devices become used again.
463 for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "unmap volume DECOMMISSION-backuph host bm-bl$bl-p$p" ; done ; done
466 *** Lastly, rescan the scsi bus on all hosts. Do not forget that. hpacucli and the monitoring tools might lock up the machine if they try to check the status of a device that now no longer exists but that the system still thinkgs is around.
468 gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan"
471 === Change the CPU type ===
473 Modern processors come with a wide variety of additional instruction sets (SSE, AES-NI, etc.) which vary from processor to processor, but can greatly improve the performance depending on the workload. Ganeti and QEMU default to a compatible subset of cpu features called qemu64, so that if the host processor is changed, or a live migration is performed, the guest will see its CPUfeatures unchanged. This is great for compatibility but comes at a performance cost.
475 The CPU presented to the guests can easily be changed, using the cpu_type option in Ganeti hypervisor options. However to still be able to live-migrate VMs from one host to another, the CPU presented to the guest should be the common denominator of all hosts in the cluster. Otherwise a live migration between two different CPU types could crash the instance.
477 For homogeneous clusters it is possible to use the host cpu type:
480 gnt-cluster modify -H kvm:cpu_type='host'
483 Otherwise QEMU provides a set of generic CPU for each generation, that can be queried that way:
486 $ qemu-system-x86_64 -cpu ?
488 x86 qemu64 QEMU Virtual CPU version 2.1.2
489 x86 phenom AMD Phenom(tm) 9550 Quad-Core Processor
490 x86 core2duo Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
491 x86 kvm64 Common KVM processor
492 x86 qemu32 QEMU Virtual CPU version 2.1.2
493 x86 kvm32 Common 32-bit KVM processor
494 x86 coreduo Genuine Intel(R) CPU T2600 @ 2.16GHz
499 x86 athlon QEMU Virtual CPU version 2.1.2
500 x86 n270 Intel(R) Atom(TM) CPU N270 @ 1.60GHz
501 x86 Conroe Intel Celeron_4x0 (Conroe/Merom Class Core 2)
502 x86 Penryn Intel Core 2 Duo P9xxx (Penryn Class Core 2)
503 x86 Nehalem Intel Core i7 9xx (Nehalem Class Core i7)
504 x86 Westmere Westmere E56xx/L56xx/X56xx (Nehalem-C)
505 x86 SandyBridge Intel Xeon E312xx (Sandy Bridge)
506 x86 Haswell Intel Core Processor (Haswell)
507 x86 Broadwell Intel Core Processor (Broadwell)
508 x86 Opteron_G1 AMD Opteron 240 (Gen 1 Class Opteron)
509 x86 Opteron_G2 AMD Opteron 22xx (Gen 2 Class Opteron)
510 x86 Opteron_G3 AMD Opteron 23xx (Gen 3 Class Opteron)
511 x86 Opteron_G4 AMD Opteron 62xx class CPU
512 x86 Opteron_G5 AMD Opteron 63xx class CPU
513 x86 host KVM processor with all supported host features (only available in KVM mode)
515 Recognized CPUID flags:
516 pbe ia64 tm ht ss sse2 sse fxsr mmx acpi ds clflush pn pse36 pat cmov mca pge mtrr sep apic cx8 mce pae msr tsc pse de vme fpu
517 hypervisor rdrand f16c avx osxsave xsave aes tsc-deadline popcnt movbe x2apic sse4.2|sse4_2 sse4.1|sse4_1 dca pcid pdcm xtpr cx16 fma cid ssse3 tm2 est smx vmx ds_cpl monitor dtes64 pclmulqdq|pclmuldq pni|sse3
518 smap adx rdseed rtm invpcid erms bmi2 smep avx2 hle bmi1 fsgsbase
519 3dnow 3dnowext lm|i64 rdtscp pdpe1gb fxsr_opt|ffxsr mmxext nx|xd syscall
520 perfctr_nb perfctr_core topoext tbm nodeid_msr tce fma4 lwp wdt skinit xop ibs osvw 3dnowprefetch misalignsse sse4a abm cr8legacy extapic svm cmp_legacy lahf_lm
522 pmm-en pmm phe-en phe ace2-en ace2 xcrypt-en xcrypt xstore-en xstore
523 kvmclock-stable-bit kvm_pv_unhalt kvm_pv_eoi kvm_steal_time kvm_asyncpf kvmclock kvm_mmu kvm_nopiodelay kvmclock
524 pfthreshold pause_filter decodeassists flushbyasid vmcb_clean tsc_scale nrip_save svm_lock lbrv npt
527 For example on a cluster using both Sandy Bridge and Haswell CPU, the following command can be used:
529 gnt-cluster modify -H kvm:cpu_type='SandyBridge'
532 Here is a typical improvement one can get on the AES openssl benchmarks.
534 With the default qemu64 CPU type:
536 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
537 aes-128-cbc 175481.21k 195151.55k 199307.09k 201209.51k 201359.36k
538 aes-128-gcm 49971.64k 57688.17k 135092.14k 144172.37k 146511.19k
539 aes-256-cbc 130209.34k 141268.76k 142547.54k 144185.00k 144777.22k
540 aes-256-gcm 39249.19k 44492.61k 114492.76k 123000.83k 125501.44k
544 With the SandyBridge CPU type:
546 type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
547 aes-128-cbc 376040.16k 477377.32k 484083.37k 391323.31k 389589.67k
548 aes-128-gcm 215921.26k 592407.87k 777246.21k 836795.39k 835971.75k
549 aes-256-cbc 309840.39k 328612.18k 330784.68k 324245.16k 328116.91k
550 aes-256-gcm 160820.14k 424322.20k 557212.50k 599435.61k 610459.65k
553 === Add a virtio-rng device ===
555 VirtIO RNG (random number generator) is a paravirtualized device that is exposed as a hardware RNG device to the guest. Virtio RNG just appears as a regular hardware RNG to the guest, which the kernel reads from to fill its entropy pool. Unfortunately Ganeti does not support it natively, therefore the kvm_extra option should be used. Ganeti forces the allocation of the PCI devices to specific slots, which means it is not possible to use the QEMU autoallocation and that an explicit PCI slot has to be provided. There 32 possible slots on the default QEMU machine, so we can use one of the last ones for example 0x1e.
557 The final command to add a virtio-rng device cluster-wise is therefore:
559 gnt-cluster modify -H kvm:kvm_extra="-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000"
562 The max-bytes and period options limit the entropy rate a guest can get to 1kB/s.