== How To Install Ganeti Clusters and Instances == === suppositions === Suppose that there are two identical hosts: foo.debian.org and bar.debian.org. They are running stable and have been integrated into Debian infrastructure. They will serve as nodes in a ganeti cluster named foobar.debian.org. They have a RAID1 array exposing three partitions: c0d0p1 for /, c0d0p2 for swap and c0d0p3 for lvm volume groups to be used by ganeti via drbd. They have two network interfaces: eth0 (public) and eth1 (private). The public network is A.B.C.0/24 with gateway A.B.C.254. The private network is E.F.G.0/24 with no gateway. Suppose that the first instance to be hosted on foobar.debian.org is qux.debian.org. The following DNS records exist: {{{ foobar.debian.org. IN A A.B.C.1 foo.debian.org. IN A A.B.C.2 bar.debian.org. IN A A.B.C.3 qux.debian.org. IN A A.B.C.4 foo.debprivate-hoster.debian.org. IN A E.F.G.2 bar.debprivate-hoster.debian.org. IN A E.F.G.3 }}} === install required packages === On each node, install the required packages: {{{ # maybe: apt-get install drbd-utils # maybe: apt-get install ganeti-instance-debootstrap apt-get install ganeti2 ganeti-htools qemu-kvm }}} === configure kernel modules === On each node, ensure that the required kernel modules are loaded at boot: {{{ ainsl /etc/modules 'drbd minor_count=255 usermode_helper=/bin/true' ainsl /etc/modules 'hmac' ainsl /etc/modules 'tun' ainsl /etc/modules 'ext3' ainsl /etc/modules 'ext4' }}} === configure networking === On each node, ensure that br0 (not eth0) and eth1 are configured. The bridge interface, br0, is used by the guest virtual machines to reach the public network. If the guest virtual machines need to access the private network, then br1 should be configured rather than eth1. To prevent the link address changing due to startup/shutdown of virtual machines, explicitly set the value. This is the interfaces file for foo.debian.org: {{{ auto br0 iface br0 inet static bridge_ports eth0 bridge_maxwait 0 bridge_fd 0 address A.B.C.2 netmask 255.255.255.0 gateway A.B.C.254 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE auto eth1 iface eth1 inet static address E.F.G.2 netmask 255.255.255.0 }}} This is the interfaces file for bar.debian.org: {{{ auto br0 iface br0 inet static bridge_ports eth0 bridge_maxwait 0 bridge_fd 0 address A.B.C.3 netmask 255.255.255.0 gateway A.B.C.254 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE auto eth1 iface eth1 inet static address E.F.G.3 netmask 255.255.255.0 }}} === configure lvm === On each node, configure lvm to ignore drbd devices and to prefer {{{/dev/cciss}}} devices names over {{{/dev/block}}} device names ([[https://code.google.com/p/ganeti/issues/detail?id=93|why?]]): {{{ ssed -i \ -e 's#^\(\s*filter\s\).*#\1= [ "a|.*|", "r|/dev/drbd[0-9]+|" ]#' \ -e 's#^\(\s*preferred_names\s\).*#\1= [ "^/dev/dm-*/", "^/dev/cciss/" ]#' \ /etc/lvm/lvm.conf service lvm2 restart }}} === create lvm volume groups === On each node, create a volume group: {{{ vgcreate vg_ganeti /dev/cciss/c0d0p3 }}} === exchange ssh keys === on each node: {{{ mkdir -m 0700 -p /root/.ssh && ln -s /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa }}} === configure iptables (via ferm) === the nodes must connect to each other over the public and private networks for a number of reasons; see the ganeti2 module in puppet === instantiate the cluster === On the master node (foo) only: {{{ gnt-cluster init \ --master-netdev br0 \ --vg-name vg_ganeti \ --secondary-ip E.F.G.2 \ --enabled-hypervisors kvm \ --nic-parameters link=br0 \ --mac-prefix 00:16:37 \ --no-ssh-init \ --no-etc-hosts \ --hypervisor-parameters kvm:initrd_path=,kernel_path= \ foobar.debian.org }}} Note the following: * the master network device is set to br0, matching the public network bridge interface created above * the volume group is set to vg_ganeti, matching the volume group created above * the secondary IP address is set to the value of the master node's interface on the private network * the nic parameters for instances is set to use br0 as default bridge * the MAC prefix is registered in the dsa-kvm git repo === add slave nodes === For each slave node (only bar for this example): on the slave, append the master's /etc/ssh/ssh_host_rsa_key.pub to /etc/ssh/userkeys/root. This is only required temporarily - once everything works, puppet will put it/keep it there. on the master node (foo): {{{ gnt-node add \ --secondary-ip E.F.G.3 \ --no-ssh-key-check \ --no-node-setup \ bar.debian.org }}} more stuff: {{{ gnt-cluster modify --reserved-lvs='vg0/local-swap.*' maybe: gnt-cluster modify --nic-parameters mode=openvswitch }}} Note the following: * the secondary IP address is set to the value of the slave node's interface on the private network === verify cluster === On the master node (foo): {{{ gnt-cluster verify }}} If everything has been configured correctly, no errors should be reported. === create the 'noop' variant === Ensure that the ganeti-os-noop is installed. ---- == How To Install Ganeti Instances == Suppose that qux.debian.org will be an instance (a virtual machine) hosted on the foobar.debian.org ganeti cluster. Before adding the instance, an LDAP entry must be created so that an A record for the instance (A.B.C.4) exists. === create the instance === On the master node (foo): {{{ gnt-instance add \ --node foo:bar \ --disk-template drbd \ --os-size 4GiB \ --os-type debootstrap+dsa \ --hypervisor-parameters kvm:initrd_path=,kernel_path= \ --net 0:ip=A.B.C.4 \ qux.debian.org }}} Note the following: * the primary and secondary nodes have been explicitly set * the operating system type is 'debootstrap+dsa' * the network interfarce 0 (eth0 on the system) is set to the instance's interface on the public network * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check. Be careful that the hostname and IP address are not taken already! ---- == Variations == If the instances require access to the private network, then there are two modifications necessary. === re-configure networking === On the nodes, ensure that br1 is configured (rather than eth1). This is the interfaces file for foo.debian.org: {{{ auto br0 iface br0 inet static bridge_ports eth0 bridge_maxwait 0 bridge_fd 0 address A.B.C.2 netmask 255.255.255.0 gateway A.B.C.254 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE auto br1 iface br1 inet static bridge_ports eth1 bridge_maxwait 0 bridge_fd 0 address E.F.G.2 netmask 255.255.255.0 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE }}} This is the interfaces file for bar.debian.org: {{{ auto br0 iface br0 inet static bridge_ports eth0 bridge_maxwait 0 bridge_fd 0 address A.B.C.3 netmask 255.255.255.0 gateway A.B.C.254 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE auto br1 iface br1 inet static bridge_ports eth1 bridge_maxwait 0 bridge_fd 0 address E.F.G.3 netmask 255.255.255.0 up ip link set addr $(cat /sys/class/net/$IFACE/address) dev $IFACE }}} === create or update the instance === When creating the instance, indicate both networks: {{{ gnt-instance add \ --node foo:bar \ --disk-template drbd \ --os-size 4GiB \ --os-type debootstrap+dsa \ --hypervisor-parameters kvm:initrd_path=,kernel_path= \ --net 0:ip=A.B.C.4 \ --net 1:link=br1,ip=E.F.G.4 \ qux.debian.org }}} * If qux.d.o does not yet exist in DNS/LDAP, you may need --no-ip-check --no-name-check. Be careful that the hostname and IP address are not taken already! When updating an existing instance, add the interface: {{{ gnt-instance shutdown qux.debian.org gnt-instance modify \ --net add:link=br1,ip=E.F.G.4 \ qux.debian.org gnt-instance startup qux.debian.org }}} Please note that the hook scripts are run only at instance instantiation. When adding interfaces to an instance, the guest opearting system must be updated manually. * If you are importing an instance from libvirt with LVM setup, you can adopt LVs: {{{ gnt-instance add -t plain --os-type debootstrap+dsa-wheezy \ --disk 0:adopt=lully-boot \ --disk 1:adopt=lully-root \ --disk 2:adopt=lully-swap \ --disk 3:adopt=lully-log \ --hypervisor-parameters kvm:initrd_path=,kernel_path= \ --net 0:ip=82.195.75.99 -n clementi.debian.org lully.debian.org }}} And you want to convert it to use DRBD afterwards and start it on the other cluster node, so we can ensure that DRBD is correctly working. {{{ gnt-instance shutdown lully.debian.org gnt-instance modify -t drbd -n czerny.debian.org lully.debian.org gnt-instance failover lully.debian.org gnt-instance startup lully.debian.org }}} * Some instances NEED ide instead of virtio {{{ gnt-instance modify --hypervisor-parameters disk_type=ide fils.debian.org }}} * To import instances with SAN volumes {{{ gnt-instance add -t blockdev --os-type debootstrap+dsa \ --disk 0:adopt=/dev/disk/by-id/scsi-reger-boot \ --disk 1:adopt=/dev/disk/by-id/scsi-reger-root \ --hypervisor-parameters kvm:initrd_path=,kernel_path= \ --net 0:ip=206.12.19.124 -n rossini.debian.org reger.debian.org }}} * How to add new LUNs to Bytemark Cluster ** Add new LUN to MSA and export to all blades {{{ Log into MSA controller Choose which vdisk to use, use "show vdisks" to list Add the volume: # create volume vdisk msa2k-2-500gr10 size 5G donizetti Find a free LUN: # show lun-maps or (if we assume they are all the same) # show host-maps 3001438001287090 Make a note of the next free LUN Generate map commands for all blades, all ports, run locally: $ for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "map volume donizetti lun 27 host bm-bl$bl-p$p" ; done ; done Paste the output into the MSA shell Find the WWN by doing show host-maps and looking for the volume name. Transform it using the sed run at the top of /etc/multipath.conf: echo "$WWN" | sed -re 's#(.{6})(.{6})0000(.{2})(.*)#36\1000\2\3\4#' }}} {{{ then: gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan" then: reload multipath-tools on gnt-master (normaly bm-bl1): service multipath-tools reload add the WWNs to dsa-puppet/modules/multipath/files/multipath-bm.conf and define the alias and commit that file to git. then: gnt-cluster command "puppet agent -t" This will update the multipath config on all cluster nodes. WITHOUT doing this, you can't migrate VMs between nodes. }}} ** Remove LUNs. Order is important, or else things get very, very confused and the world needs a reboot. *** Make sure nothing uses the volume anymore. *** Make sure we do not have any partitions lying around for it: {{{ gnt-cluster command "ls -l /dev/mapper/backuphost*" # and maybe: gnt-cluster command "kpartx -v -p -part -d /dev/mapper/backuphost" }}} *** flush the device, remove the multipath mapping, flush all backing devices: {{{ root@bm-bl1:~# cat flush-mp-device #!/bin/sh dev="$1" if [ -z "$dev" ] || ! [ -e "$dev" ]; then echo 2>&1 "Device $dev does not exist." exit 1 fi devs=$(multipath -ll "$dev" | grep cciss | sed -e 's/.*cciss!//; s/ .*//;') if [ -z "$devs" ]; then echo 2>&1 "No backends found for $dev." exit 1 fi set -e blockdev --flushbufs "$dev" multipath -f "$dev" for d in $devs; do blockdev --flushbufs "/dev/cciss/$d" done echo done. }}} {{{ gnt-cluster command "/root/flush-mp-device /dev/mapper/backuphost" }}} *** Immediately afterwards, paste the output of the following to the MSA console. Best prepare this before, so you do it quickly before anything else rescans stuff, reloads or restarts multipathd, and the devices become used again. {{{ for bl in 1 2 3 4 5 6 ; do for p in 1 2 3 4; do echo "unmap volume DECOMMISSION-backuph host bm-bl$bl-p$p" ; done ; done }}} *** Lastly, rescan the scsi bus on all hosts. Do not forget that. hpacucli and the monitoring tools might lock up the machine if they try to check the status of a device that now no longer exists but that the system still thinkgs is around. {{{ gnt-cluster command "echo 1 > /sys/bus/pci/devices/0000:0e:00.0/cciss0/rescan" }}} === DRBD optimization === The default DRBD parameters are not really optimized, which means very slow (re)syncing. The following commands might help to make it faster. Of course the max speed can be increased if both the network and disk speed allow that. {{{ gnt-cluster modify -D drbd:net-custom="--max-buffers 36k --sndbuf-size 1024k --rcvbuf-size 2048k" gnt-cluster modify -D drbd:c-min-rate=32768 gnt-cluster modify -D drbd:c-max-rate=98304 gnt-cluster modify -D drbd:resync-rate=98304 }}} === Change the disk cache === When using raw volumes or partitions, it is best to avoid the host cache completely to reduce data copies and bus traffic. This can be done using: {{{ gnt-cluster modify -H kvm:disk_cache=none }}} === Change the CPU type === Modern processors come with a wide variety of additional instruction sets (SSE, AES-NI, etc.) which vary from processor to processor, but can greatly improve the performance depending on the workload. Ganeti and QEMU default to a compatible subset of cpu features called qemu64, so that if the host processor is changed, or a live migration is performed, the guest will see its CPUfeatures unchanged. This is great for compatibility but comes at a performance cost. The CPU presented to the guests can easily be changed, using the cpu_type option in Ganeti hypervisor options. However to still be able to live-migrate VMs from one host to another, the CPU presented to the guest should be the common denominator of all hosts in the cluster. Otherwise a live migration between two different CPU types could crash the instance. For homogeneous clusters it is possible to use the host cpu type: {{{ gnt-cluster modify -H kvm:cpu_type='host' }}} Otherwise QEMU provides a set of generic CPU for each generation, that can be queried that way: {{{ $ qemu-system-x86_64 -cpu ? x86 qemu64 QEMU Virtual CPU version 2.1.2 x86 phenom AMD Phenom(tm) 9550 Quad-Core Processor x86 core2duo Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz x86 kvm64 Common KVM processor x86 qemu32 QEMU Virtual CPU version 2.1.2 x86 kvm32 Common 32-bit KVM processor x86 coreduo Genuine Intel(R) CPU T2600 @ 2.16GHz x86 486 x86 pentium x86 pentium2 x86 pentium3 x86 athlon QEMU Virtual CPU version 2.1.2 x86 n270 Intel(R) Atom(TM) CPU N270 @ 1.60GHz x86 Conroe Intel Celeron_4x0 (Conroe/Merom Class Core 2) x86 Penryn Intel Core 2 Duo P9xxx (Penryn Class Core 2) x86 Nehalem Intel Core i7 9xx (Nehalem Class Core i7) x86 Westmere Westmere E56xx/L56xx/X56xx (Nehalem-C) x86 SandyBridge Intel Xeon E312xx (Sandy Bridge) x86 Haswell Intel Core Processor (Haswell) x86 Broadwell Intel Core Processor (Broadwell) x86 Opteron_G1 AMD Opteron 240 (Gen 1 Class Opteron) x86 Opteron_G2 AMD Opteron 22xx (Gen 2 Class Opteron) x86 Opteron_G3 AMD Opteron 23xx (Gen 3 Class Opteron) x86 Opteron_G4 AMD Opteron 62xx class CPU x86 Opteron_G5 AMD Opteron 63xx class CPU x86 host KVM processor with all supported host features (only available in KVM mode) Recognized CPUID flags: pbe ia64 tm ht ss sse2 sse fxsr mmx acpi ds clflush pn pse36 pat cmov mca pge mtrr sep apic cx8 mce pae msr tsc pse de vme fpu hypervisor rdrand f16c avx osxsave xsave aes tsc-deadline popcnt movbe x2apic sse4.2|sse4_2 sse4.1|sse4_1 dca pcid pdcm xtpr cx16 fma cid ssse3 tm2 est smx vmx ds_cpl monitor dtes64 pclmulqdq|pclmuldq pni|sse3 smap adx rdseed rtm invpcid erms bmi2 smep avx2 hle bmi1 fsgsbase 3dnow 3dnowext lm|i64 rdtscp pdpe1gb fxsr_opt|ffxsr mmxext nx|xd syscall perfctr_nb perfctr_core topoext tbm nodeid_msr tce fma4 lwp wdt skinit xop ibs osvw 3dnowprefetch misalignsse sse4a abm cr8legacy extapic svm cmp_legacy lahf_lm invtsc pmm-en pmm phe-en phe ace2-en ace2 xcrypt-en xcrypt xstore-en xstore kvmclock-stable-bit kvm_pv_unhalt kvm_pv_eoi kvm_steal_time kvm_asyncpf kvmclock kvm_mmu kvm_nopiodelay kvmclock pfthreshold pause_filter decodeassists flushbyasid vmcb_clean tsc_scale nrip_save svm_lock lbrv npt }}} For example on a cluster using both Sandy Bridge and Haswell CPU, the following command can be used: {{{ gnt-cluster modify -H kvm:cpu_type='SandyBridge' }}} Here is a typical improvement one can get on the AES openssl benchmarks. With the default qemu64 CPU type: {{{ type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 175481.21k 195151.55k 199307.09k 201209.51k 201359.36k aes-128-gcm 49971.64k 57688.17k 135092.14k 144172.37k 146511.19k aes-256-cbc 130209.34k 141268.76k 142547.54k 144185.00k 144777.22k aes-256-gcm 39249.19k 44492.61k 114492.76k 123000.83k 125501.44k }}} With the SandyBridge CPU type: {{{ type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 376040.16k 477377.32k 484083.37k 391323.31k 389589.67k aes-128-gcm 215921.26k 592407.87k 777246.21k 836795.39k 835971.75k aes-256-cbc 309840.39k 328612.18k 330784.68k 324245.16k 328116.91k aes-256-gcm 160820.14k 424322.20k 557212.50k 599435.61k 610459.65k }}} === Add a virtio-rng device === VirtIO RNG (random number generator) is a paravirtualized device that is exposed as a hardware RNG device to the guest. Virtio RNG just appears as a regular hardware RNG to the guest, which the kernel reads from to fill its entropy pool. Unfortunately Ganeti does not support it natively, therefore the kvm_extra option should be used. Ganeti forces the allocation of the PCI devices to specific slots, which means it is not possible to use the QEMU autoallocation and that an explicit PCI slot has to be provided. There 32 possible slots on the default QEMU machine, so we can use one of the last ones for example 0x1e. The final command to add a virtio-rng device cluster-wise is therefore: {{{ gnt-cluster modify -H kvm:kvm_extra="-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000" }}} The max-bytes and period options limit the entropy rate a guest can get to 1kB/s.