MySQL Cluster 7.6 includes new Cluster Configurator

After dealing with Windows performance, I switched to MySQL Cluster Configurator (MCC for short) project. This was quite a change for me having to deal with Python back-end and JavaScript front-end, languages I was not so familiar with.

For the history of the project, please refer to Andrew's blog post for I will concentrate on changes made in new version.

There are many exciting new features in MySQL Cluster 7.6.4DMR including new MCC. To download MySQL Cluster 7.6.4DMR, please go to Development Releases tab. To see what's new in 7.6.4DMR, please follow this link.

MySQL Cluster Configurator in short

With a single command launch the web-based wizard which then steps you through configuring, deploying configuration and starting the cluster; to keep things even simpler, it will automatically detect the resources on your target machines and use these results together with the type of workload you specify in order to determine values for the key configuration parameters.
The software is part of the MySQL Cluster package (NDB). To run on Windows, just double click setup.bat – note that if you installed from the MSI and didn’t change the install directory then this will be located somewhere like C:\Program Files...\MySQL\MySQL Cluster 7.6. On Linux, just run ndb_setup from the bin folder.
On Windows, we provide all necessary libraries and the Python while on Linux, you should have Python 2.7 (latest) installed as well as Paramiko 2 or newer. Paramiko will install Cryptography and other required libraries.
If you launch the installer from a desktop environment shell then the first page of the wizard will automatically open in your web browser, if not then just browse to the URL that is displayed on the command line.
Recap: Start the Python web server from shell using setup.bat (Windows) or bin/ndb_setup (*nix), JavaScript front-end will be opened for you in default browser. If it's not, copy the link from console to your favorite browser. Make sure you have administrator rights on all the hosts you will use in cluster. Also, make sure to have MySQL Cluster installed on all the boxes making up cluster.
There is an extensive help file in /hlp/html directory, please read if you're new to product.

Who is this software for

Typically, MCC would be used by developers working on Cluster software and people looking for quick evaluation of MySQL Cluster on various hardware configurations including cloud instances.

New version highlights

  • Multiple reusable configurations. Take your configuration with you to work on different boxes.
  • Much safer configuration files allowing for saving passwords/passphrases resulting in full restoration of NDB Cluster from configuration file.
  • Many new options to configure both for NDB Cluster processes as well as for mysqld. Multiple choice options have proper drop-down lists associated.
  • Many new members in Cluster and Host objects. GUI changes to present them.
  • Nodes can have external IP address (to gain access from MCC for example) as well as internal IP address (for faster and safer communication inside cloud deployment or other VPN environment).
  • Extended ways of authentication such as using different username, private key on per host basis, using keys with passphrases and so on.
  • Better detection of running cluster processes and option to stop them even if startup fails.

Language choice

JavaScript was obvious choice since it's common to have GUI run in browser. Due to it's limitations (none likes browser sniffing around file system), we have back-end web server in Python handling various requests related to files, remote hosts etc. Front and back end communicate via POST messages so it is really important that you protect this communication. For a start, we provide self signed certificates for securing it. If you are testing things in sand-box, you may switch https for http to speed up things.
Recap: Python back end manipulates encryption/decryption, files, connections to local/remote hosts and so on while JavaScript front-end provides GUI in your browser for convenient presentation of configuration options.


Changes in configuration files and authentication methods:

First change tackled is the configuration. Versions shipped with previous Cluster releases had two major drawbacks; they kept configuration in cookies and could handle just one Cluster configuration which was locked to box where you created it. Keeping Cluster configuration in cookies comes with two major drawbacks; cookies are size-limited and very insecure.
To remedy this situation, we opted for external configuration file saved in current user's HOME directory (this is done through Python web server that's running MCC); one configuration per file. The file itself is AES encrypted using passphrase provided in front-end and passed down to Python web server via POST message. Passphrase is then kept in memory for the duration of session. With provided passphrase and file name, Python server sends decrypted configuration back to front-end. Configuration is kept in global.windowname variable for the duration of the session.

This has solved several issues:
  • Configuration now stores all your passwords, passphrases for keys and such allowing for quick and full recreation of Cluster.
  • Extending configuration size allows for per-host credentials. I.e. each host in Cluster can have it's own way of authentication and set of credentials.
Essentially, when looking for quick and dirty deployment over hosts that use same credentials, it is enough to provide them on Page 2, Cluster Configuration. When working with hosts that have their own separate credentials, you can define them on Host level (Page 3, Add Hosts). We prefer adding hosts on Page 3 via Add Host button to defining them on Page 2 "Host list" box.
Credentials can be:
  • Username/password.
  • Username/Key name.
  • Username/Key name/Passphrase.
  • Key name/Passphrase.
  • Key name.
  • Nothing.
If PATH is not provided along with Key name, ~/.ssh is assumed. If using keys and no key name is provided, ~/.ssh/id_rsa is assumed. So, for authentication using standard private key without passphrase stored in default place, you just need to check "Key based SSH" (Page 2, Cluster level) or "Key-based auth" for particular host (Page 3, Add/Edit Host) checkbox and nothing more. If there are absolutely no credentials at host level, program behaves as in old versions meaning whatever was provided on Cluster level is used.

WARNING: Although asynchronous, call to save configuration changes does take some time to complete. The saving of configuration takes place after pressing "Save&Next" button:

No changes are saved if you close the tab or browser abruptly! After making extensive changes and pressing Save&Next I like to allow for some time for configuration to be saved. If you open debug console, you will see the notification.
If you Cluster is all set up and ready to go and you just want to take one more look at various configuration options, you can use breadcrumb buttons as they do not trigger save method:

Since all credentials are saved, you can skip looking into loaded configuration altogether and go directly to "Deploy configuration" page.
Gotcha: In order to make configurations portable, we had to limit the usage of "localhost". If you include localhost in your Cluster, you will not be able to add any more remote hosts.
Gotcha: List with available configurations is provided to front-end upon welcome page load so if you add more configurations to your HOME using external tools they will not be shown (unless you reload page or restart entire program). However, if you choose "New Configuration" and provide the name of existing one, the existing configuration will be loaded.
Gotcha: "New Configuration" requires you to provide a passphrase and a confirmation. Reading existing configuration just requires passphrase. Please keep your passphrase(s) safe as there is no way to reverse engineer contents of configuration file without it! Each file/configuration can have it's own passphrase.
Recap: There can be any number of portable, encrypted configurations now which you can find in your HOME on box running MCC. Each host can have it's own way of authentication and a set of credentials. We did our best to guess which auth method and credentials to use based on input provided.

Changes in Cluster object (Page 2):

Cluster object has two new attributes:
  • Install MySQL Cluster: Option for installing Cluster on hosts
    NONE: No installation of Cluster will take place (default).
    BOTH: Both DOCKER and installation from REPO will be considered depending on OS and available images. Should both installation mechanisms be available on certain host, we will prefer REPO over DOCKER.
    REPO: For every host in Cluster, the attempt will be made to install Cluster SW from the repository URL.
    DOCKER: For every host in Cluster, the attempt will be made to install Cluster SW DOCKER image.
  • Open FW ports: Check if you need the tool to try to open necessary ports for Cluster processes on every hosts.
Gotcha: In this version "Install MySQL Cluster" is not functional so you need Cluster installed on all hosts beforehand.

Changes in Host object and GUI (Page 3):

Host object has undergone major rework. Host name is now used for external host IP address i.e. IP address at which MCC web server instance can reach that particular host. Host Internal IP refers to that particular host IP address inside VPN. If there is no VPN in play, both IP addresses are the same. We strongly encourage using IP addresses here to skip potential problems with resolving host names.
Authentication, as per above, could be a) using keys or b) ordinary username/password. If you check Key-based auth when Adding/Editing host and provide nothing, ~/.ssh/id_rsa key will be used without passphrase. You can also define alternate user (to one starting MCC) which comes handy when logging in to domains. Each key can have passphrase and you can provide the path to and name of specific key to use for that host.
If you check Configure installation, you will be presented with additional fields relating to repository and Docker image. We do our best to provide you with default values based on OS running on particular host you're Adding/Editing.

We have also added Show/Hide extended host info toggle button switching between single and double line host info representation.

Changes to Define parameters (Page 5):

We are constantly adding more configurable parameters to keep up with MySQL server and Cluster evolution but this is a moving target so we ask for patience if certain parameter you'd like to configure is missing.
In addition to many new configurable parameters, we have extended a GUI so that, for appropriate options, you get drop-down list of allowed values.

Changes to Deploy configuration (Page 6, last):

The Start and Stop Cluster buttons now behave more intelligently in terms of determining if the Cluster or any of its processes is running and enabling/disabling appropriate buttons.
Previously, if Cluster was stuck in any of the startup phases, you had to terminate MCC and kill all the processes on all of the hosts manually. Now, if you think something is wrong, you can close the progress window and it will give control back to MCC enabling Stop Cluster button which you can then press to stop the stray processes properly. It will also provide you with list of log files which you can then check for problems.
Stopping mysqld process(es) might not work if you changed the credentials from command line.

These changes are available in version 7.6.4DMR and up. We encourage you to try MCC and MySQL Cluster in your environment!


Working with more than 64 CPUs in Powershell

Wrote this several months ago but was too busy to publish :-/

As noted in one of the previous blog post, I will use following terminology:
  • "Processor" is a piece of hardware you connect to a socket on the motherboard.
  • "Physical Core" is a physical computing unit built into the "Processor".
  • "Virtual Core" is a virtual computing unit built on top of "Physical Core" (i.e. HT is ON).
  • "CPU" is a computing unit inside the "Processor", either physical or virtual.

After a series of blogs on Windows performance counters and after releasing sysb.ps1 testing/benchmarking framework version 0.9RC (dbt2- I set out to eliminate some unknowns from the testing. First to tackle was Kernel scheduler in an effort to run processes, from inside the Powershell script, on controlled subset of CPUs much like TASKSET does on Linux. Also worth noting is that proximity rocks, on occasion, meaning you can get up to 20% better results when the workload is distributed perfectly. However, this is hard to achieve thus I'm more going after consistency in test environment.
This posed quite a bit of challenges; knowing the details of hardware, NUMA node assignments, finding out and evaluating various ways of controlling the CPU pinning to calculating CPU affinity mask for more than 64 CPUs.
One interesting challenge was to calculate the CPU indexes for MySQL Cluster thread config.
As a first step, I had to find out as much as possible about my hardware.

Know your hardware:

PS> Get-CimInstance Win32_BIOS
SMBIOSBIOSVersion : 11018100   
Manufacturer      : American Megatrends Inc.
Name              : Default System BIOS
SerialNumber      : 1207FMA00C            
Version           : SUN    - 20151001

PS> Get-CimInstance Win32_ComputerSystem | FL *
Status                      : OK
Name                        : HEL01
Roles                       : {LM_Workstation, LM_Server, NT, Server_NT}
AutomaticManagedPagefile    : False
DomainRole                  : 3
HypervisorPresent           : False
Manufacturer                : Oracle Corporation 
Model                       : Sun Fire X4800
NetworkServerModeEnabled    : True
NumberOfLogicalProcessors   : 96
NumberOfProcessors          : 8
PartOfDomain                : True
SystemType                  : x64-based PC
TotalPhysicalMemory         : 549746266112

PS> Get-CimInstance Win32_ComputerSystemProcessor | FL *
GroupComponent        : Win32_ComputerSystem (Name = "HEL01")
PartComponent         : Win32_Processor (DeviceID = "CPU0")
CimClass              : root/cimv2:Win32_ComputerSystemProcessor
CimInstanceProperties : {GroupComponent, PartComponent}
PartComponent         : Win32_Processor (DeviceID = "CPU1")
PartComponent         : Win32_Processor (DeviceID = "CPU2")
PartComponent         : Win32_Processor (DeviceID = "CPU3")
PartComponent         : Win32_Processor (DeviceID = "CPU4")
PartComponent         : Win32_Processor (DeviceID = "CPU5")
PartComponent         : Win32_Processor (DeviceID = "CPU6")
PartComponent         : Win32_Processor (DeviceID = "CPU7")

PS> Get-CimInstance Win32_PerfFormattedData_PerfOS_NUMANodeMemory
Name                      : 0
AvailableMBytes           : 64530
FreeAndZeroPageListMBytes : 63989
StandbyListMBytes         : 541
TotalMBytes               : 65526
Name                      : 7
AvailableMBytes           : 64600
FreeAndZeroPageListMBytes : 64387
StandbyListMBytes         : 213
TotalMBytes               : 65536

PS> Get-CimInstance Win32_SystemSlot
SlotDesignation : EM00 PCIExp
Tag             : System Slot 0
SupportsHotPlug : True
Status          : OK
Shared          : True
PMESignal       : True
MaxDataWidth    : 8
SlotDesignation : EM01 PCIExp
Tag             : System Slot 1
SlotDesignation : EM30 PCIExp
Tag             : System Slot 2
SlotDesignation : EM31 PCIExp
Tag             : System Slot 3
SlotDesignation : EM10 PCIExp
Tag             : System Slot 4
SlotDesignation : EM11 PCIExp
Tag             : System Slot 5
SlotDesignation : EM20 PCIExp
Tag             : System Slot 6
SlotDesignation : EM21 PCIExp
Tag             : System Slot 7

PS> Get-CimInstance Win32_PerfFormattedData_Counters_ProcessorInformation
Name                        : 0,0
PercentofMaximumFrequency   : 100
PercentPerformanceLimit     : 100
PercentProcessorPerformance : 69
ProcessorFrequency          : 2001
Name                        : 0,11
Name                        : 7,0
PercentofMaximumFrequency   : 100
PercentPerformanceLimit     : 100
PercentProcessorPerformance : 72
ProcessorFrequency          : 2001
Name                        : 7,11
Or, in short, my test box has 2 Processor groups with 48 CPUs each. This makes for Max. CPU affinity mask of 281474976710655d (or 111111111111111111111111111111111111111111111111b). The total number of CPUs is 96, total number of sockets and NUMA nodes is 8.

Note: Notice there are exactly 48 "1" in Max CPU Affinity mask which is the number of CPUs in each Processor group. This implies you can only set process affinity mask on per Processor group basis, not machine-wide! This limitation is caused by CPUs affinity mask being 64 bits long.
Groups, NUMA nodes etc. assignments are not chiseled in stone. Please see MSDN for details on how to manipulate these settings.

Once done playing with WMI, you can turn to coreinfo from Sysinternals suite as it's extremely informative:
Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
Intel64 Family 6 Model 46 Stepping 6, GenuineIntel
Microcode signature: 00000009
HTT        * Hyperthreading enabled
HYPERVISOR - Hypervisor is present
VMX        * Supports Intel hardware-assisted virtualization
SVM        - Supports AMD hardware-assisted virtualization
X64        * Supports 64-bit mode

SMX        - Supports Intel trusted execution
SKINIT     - Supports AMD SKINIT
Important to notice is that, in my configuration, Sockets map to NUMA nodes 1-1:
Logical Processor to Socket Map:                  Logical Processor to NUMA Node Map:
Socket 0:                                         NUMA Node 0:
************------------------------------------  ************------------------------------------
------------------------------------------------  ------------------------------------------------  
Socket 1:                                         NUMA Node 1:
------------------------------------------------  ------------------------------------------------
************------------------------------------  ************------------------------------------
Socket 2:                                         NUMA Node 2:
------------************------------------------  ------------************------------------------
------------------------------------------------  ------------------------------------------------
Socket 3:                                         NUMA Node 3:
------------------------------------------------  ------------------------------------------------
------------************------------------------  ------------************------------------------
Socket 4:                                         NUMA Node 4:
------------------------************------------  ------------------------************------------
------------------------------------------------  ------------------------------------------------
Socket 5:                                         NUMA Node 5:
------------------------------------------------  ------------------------------------------------
------------------------************------------  ------------------------************------------
Socket 6:                                         NUMA Node 6:
------------------------------------************  ------------------------------------************
------------------------------------------------  ------------------------------------------------
Socket 7:                                         NUMA Node 7:
------------------------------------------------  ------------------------------------------------
------------------------------------************  ------------------------------------************
so I can use Processor/Socket/NUMA node as though they are synonyms. Also, notice that NUMA node/Socket 0 and even ones are in Processor group 0 while odd sockets are in Processor group 1. Here is how CPU utilization looks like in Task manager/Performance tab when just ProcessorGroup 0 is used:

Logical Processor to Group Map:
Group 0:                                          Group 1:
************************************************  ------------------------------------------------
------------------------------------------------  ************************************************
Note: Coreinfo provides NUMA nodes latency too:
Approximate Cross-NUMA Node Access Cost (relative to fastest):
     00  01  02  03  04  05  06  07
00: 1.4 1.7 2.1 1.7 1.7 2.1 2.2 2.1
01: 1.7 1.4 1.7 2.1 2.1 1.7 2.0 1.3
02: 2.1 1.7 1.4 1.7 2.1 2.1 1.6 1.2
03: 1.8 2.1 1.7 1.4 2.1 2.1 2.0 1.1
04: 1.7 2.1 2.1 2.1 1.4 1.7 1.7 1.4
05: 2.1 1.7 2.1 2.1 1.7 1.4 2.0 1.0
06: 2.1 2.1 1.7 2.1 1.7 2.1 1.4 1.3
07: 2.1 2.1 2.1 1.7 2.1 1.7 1.6 1.0

The software:

Primary tool used is sysb.ps1 Powershell script version 1.0 (not available for download atm). Version 0.9x RC is available for download and placed in dbt2-\dbt2-\dbt2-\windows_scripts\sysb-script\ directory.

OS details:
PS:518 [HEL01]> Get-CimInstance Win32_OperatingSystem | FL *
Status                                    : OK
Name                                      : Microsoft Windows Server 2012 R2 Standard
FreePhysicalMemory                        : 528660256
FreeSpaceInPagingFiles                    : 8388608
FreeVirtualMemory                         : 537242324
Distributed                               : False
MaxNumberOfProcesses                      : 4294967295
MaxProcessMemorySize                      : 137438953344
OSType                                    : 18
SizeStoredInPagingFiles                   : 8388608
TotalSwapSpaceSize                        : 
TotalVirtualMemorySize                    : 545250196
TotalVisibleMemorySize                    : 536861588
Version                                   : 6.3.9600
BootDevice                                : \Device\HarddiskVolume1
BuildNumber                               : 9600
BuildType                                 : Multiprocessor Free
CodeSet                                   : 1252
DataExecutionPrevention_32BitApplications : True
DataExecutionPrevention_Available         : True
DataExecutionPrevention_Drivers           : True
DataExecutionPrevention_SupportPolicy     : 3
Debug                                     : False
ForegroundApplicationBoost                : 2
LargeSystemCache                          : 
Manufacturer                              : Microsoft Corporation
OperatingSystemSKU                        : 7
OSArchitecture                            : 64-bit
PAEEnabled                                : 
ServicePackMajorVersion                   : 0
ServicePackMinorVersion                   : 0

So how do the Windows work?

Process is just a container for threads doing the work providing you with fancy name, PID etc. This effectively means you can not calculate "System load" like on Linux. This also explains why there is no ProcessorGroup member attached to Process class while there is one for Threads. This also makes all sorts of problems regarding CPU utilization as described in previous blogs here and here.
Processor group is a collection of up to 64 CPUs as explained here and here.
Thread is a basic unit of execution. Setting the Thread affinity will influence the Process class and dictate what you can do with it. There is a great paper on this you can download from MSDN to figure it out. The focus of this blog is on scripting.

Know the OS pitfalls:

The setup: I have a script acting as testing/benchmarking framework. Script controls the way processes are launched, collects data from running processes and generally helps me do part of my job of identifying performance issues and testing solutions.
The problem: Windows is thread based OS and I can not control the threads in binary from within the script.
Next, .NET System.Diagnostics.Process class does not expose Processor group bit. This means there is no way to control Processor group and thus no way to guarantee the kernel scheduler will start all of your processes inside the Processor group you want :-/ I consider this a bug and not deficiency in Windows because of the following scenario:
   "ProcessA" is pinned, by scheduler, to Processor group 0 with ability to run on all CPUs within that group.
   "ProcessB" is pinned, by scheduler, to Processor group 1 with ability to run on all CPUs within that group.
   ProcessorAffinity member of System.Diagnostics.Process class is the same in both cases!
  $procA = Get-Process -Name ProcessA
  281474976710655 #For my 48 CPUs in each Processor group.

  $procB = Get-Process -Name ProcessB
  281474976710655 #For my 48 CPUs in each Processor group.
This leads you to believe that both processes run in the same Processor group, which might not be true as the information is ambiguous. I have set up mysqld to run on 1st NUMA node and part of second (12 + 8 CPUs). At the same time, Sysbench is pinned to NUMA node 0, last 4 CPUs. When scheduler decides to run mysqld on Processor group 1, the CPU load distribution is like this:
NUMA #0, last 4 CPUs lit up by Sysbench. NUMA #1 and part of 3, lit up by mysqld.

Using the same(!) Process.ProcessorAffinity for mysqld for subsequent run but this time the scheduler decides it will run mysqld on Processor group 0:
NUMA #0, last 4 CPUs lit up by Sysbench and mysqld.
NUMA #2 in part lit up by mysqld.

It is obvious how later case will most likely produce much lower results since mysqld is competing with Sysbench (on last 4 CPUs of the NUMA node 0) and Windows (first 2 CPUs of NUMA node 0). This is indicative of 2 things:
  a) Microsoft rushed solution for big boxes (> 64 CPUs) and it is not mature nor will it scale.
  b) You can not trust Kernel scheduler to do the right thing on its own as it has no clue as to what will be your next move.
I might add here that even the display in Task manager lacks the ability to display CPU load per ProcessorGroup...

Before you send me to RTFM and do this the "proper" way, please notice that the CPU usage pattern for NUMA nodes 5 and 7 is the same in both runs. This is because our Cluster knows how to pin threads to CPUs "properly". Alas, I do not think this is possible from the Powershell.
Also notice the lack of ProcessorGroup member in System.Diagnostic.Process class. I expected at least ProcessorGroup with getter function (if not complete getter/setter) so I can break the run if scheduler makes the choice I'm not happy with.
The last problem to mention is late binding of Affinity mask :-/. The code might look like this:

    $sb_psi = New-object System.Diagnostics.ProcessStartInfo 
    $sb_psi.CreateNoWindow = $true 
    $sb_psi.UseShellExecute = $false 
    $sb_psi.RedirectStandardOutput = $true
    $sb_psi.RedirectStandardError = $true
    $sb_psi.FileName = "$PathToSB" + '\sysbench.exe '
    $sb_psi.Arguments = @("$sbArgList") 

    $sb_process = $null
    $sb_process = New-Object System.Diagnostics.Process 
    $sb_process.StartInfo = $sb_psi
    [void]$sb_process.Start() <<<<
    #Now you can set the Affinity mask:
    $sb_process.ProcessorAffinity = $SCRIPT:SP_BENCHMARK_CPU
IMO, process.ProcessorAffinity should go to System.Diagnostics.ProcessStartInfo.
I can't help but to wonder what will happen if Intel decides to release single processor with 64+ CPUs?

What are our options in Powershell then?

Essentially, you can use 3 techniques to start the process in Powershell and bind it to CPUs but you have to bear in mind that this is not what Microsoft expects you to do so each approach has its pro's and con's:
1) Using START in cmd.exe (start /HIGH /NODE 2 /AFFINITY 0x4096 /B /WAIT E:\test\...\sysbench.exe --test=oltp...)
  BENCHMARK_CPU="111100000000" # Xeon E7540 has 12 CPUs per socket so I'm running on LAST 4 (9,10,11 and 12).
These options allow user to run Sysbench on certain NUMA node as well as certain CPUs within that NUMA node.

  SERVER_CPU="111111111" #(Or, 000111111111) Running on first 9 CPUs.
 It is not necessary to set CPUs to run on if you're running on entire dedicated NUMA node.

Pros: Works.
Cons: The process you're starting is not the expected one (say, benchmark) but rather cmd.exe START.
      Not really "Powershell-way".
      Process is bound to just one NUMA node which is fine if it's not hungry for more CPU power.

2) Using .NET System.Diagnostics.Process (PS, C#):
 $process = Start-Process E:\test\mysql-cluster-7.5.0-winx64\bin\mysqld.exe -ArgumentList "--standalone --console
 --initialize-insecure" -WindowStyle Hidden -PassThru -Wait -RedirectStandardOutput e:\test\stdout.txt
 -RedirectStandardError e:\test\stderr.txt
 $process.ProcessorAffinity = 70368739983360

 Affinity mask means mysqld runs on NUMA node 7, 5 and part of 3 (0-based index)
 IF ProcessorGroup is set to 1 by Kernel scheduler:
 001111111111111111111111110000000000000000000000 = 70368739983360
 |___________________48 CPUs____________________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1


  #Run on NUMA node 7, last 2 CPUs, 110000000000000000000000000000000000000000000000b

Pros: Real "Powershell-way" of doing things.
      Process can span over more than 1 NUMA node.
      Good control of the process (HasExited, ExitTime, Kill, ID (PID) ...).
Cons: Late binding; i.e. process has to be up and running for you to pin it to CPUs. This presents a problem with processes
      that start running immediately.
      No way to control Processor group meaning there is no way to guarantee the kernel scheduler will start all of your
      processes inside the desired Processor group.
Note: Using -PassThru ensures you will get Process object. Otherwise, Start-Process cmdlet has no output. Also, you can start the process and then use Get-Process -Name... to accomplish the same.

Not available in Powershell AFAIK but important to understand if using MySQL Cluster:
3) Hook the threads to CPUs. Since this is not available from the "outside", I will use the Cluster code to do the work for me:

#NUMA node to run sysbench on.
#Zero based index.
#CPUs inside selected NUMA node to run sysbench on.
 |_12 CPUs__|
   NUMA #0
CPU0   CPU11

 Affinity mask means mysqld runs on NUMA node 7, 5 and part of 3 (0-based index) IF ProcessorGroup is 1:
 001111111111111111111111110000000000000000000000 = 70368739983360d
 |___________________48 CPUs____________________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1

Test image shows
 000000000000000000000000000011111111111111111111 = 1048575d
 |___________________48 CPUs____________________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1
Sysbench is running on NUMA #0, last 4 CPUs.
MySQLd is running on NUMA #1 and last 8 CPUs of NUMA #3.
LDM threads are running on first 4 CPUS node #5 together with 2 TC, SEND and RCV threads.
LDM threads are running on first 6 CPUS node #7 together with 2 TC and 1 MAIN and REPL
threads with CPUs 107 and 110(Last one) not being used.

Calculating ProcessorAffinity mask for process is different depending on the function accepting the input.
1) For cmd.exe START, the actual number passed is in HEX notation. The binary mask is composed so that the highest index CPU comes first:
 |_12 CPUs__|
   NUMA #0
CPU0     CPU11
It is more convenient to provide the mask in binary so I convert setting to Hex value inside the script.
The NUMA node to run on is specified as decimal integer.
If you have provided the NUMA node # for the process to run on, not specifying ProcessorAffinity mask means "run on all CPUs within specified node".
If you provide the wrong mask, process will fail to start. For example, I have 12 CPUs per NUMA node (socket) so providing the mask like "11111111111000" will fail.
The approach works only on one NUMA node.

2) Start Process expects decimal integer for mask. The rightmost "1" indicates usage of CPU #0 within Processor group assigned by Kernel scheduler in Round-Robin manner.
 000000000000000000000000000011111111111111111111 = 1048575d
 |___________________48 CPUs____________________|
   NUMA #7      NUMA #5    NUMA #3      NUMA #1
or, should the scheduler pick Processor group 0:
   NUMA #6      NUMA #4    NUMA #2      NUMA #0
Start process takes (and returns) decimal value for ProcessorAffinity.
It uses late binding so Process has to be up and running before assigning Affinity mask to it.
You have no control over ProcessorGroup meaning Kernel scheduler is free to pick any NUMA node in Round-Robin fashion.

3) Doing things "properly" (binding threads to CPUs). Or, how to calculate ThreadConfig for MySQL Cluster:
ThreadConfig=ldm={count=10,cpubind=88-91,100-105},tc={count=4,cpubind=94-95,106-107},send={count=2,cpubind=92-93},recv={count=2,cpubind=98,99},main={count=1,cpubind=109},rep={count=1,cpubind=109} shows CPU indexes above total number of CPUs available on my test system (2x48=96). This has to do with the maximum capacity of Processor group which is 64. The designer of this functionality treats each Processor group found on system as full meaning it occupies 64 places for CPU index. This makes sense if you are going from the box with 48 CPUs in group (like mine) to a box with 64 CPUs in group as your ThreadConfig line will continue to work as expected. However, it requires some math to come to CPU indexes:

Processor group 0                                              |Processor group 1
CPU#0                                    CPU#47          CPU#63CPU#64                                    CPU#110       CPU#127
|                  AVAILABLE                  |     RESERV    ||                   AVAILABLE                 |     RESERV    |
Now my ThreadConfig line makes sense:
LDM threads are running on first 4 CPUS node #5 (88-91) together with 2 TC (94,95), SEND (92,93) and RCV (98,99) threads.
LDM threads are running on first 6 CPUS node #7 (100-105) together with 2 TC (106,107) and 1 MAIN and REPL (109)
threads with CPUs 107 and 110(Last one) not being used.


o Windows use notion of Processor group. Machines with less than 64 CPUs have 1 Processor group thus your application runs exactly as before.
  o Bug 1: Affinity mask is only 64-bit wide so there is no way to have continuous index of CPUs inside the big box such as mine.
o .NET System.Diagnostics.Process has no get/set of Processor group. At least a getter function was expected and a member of System.Diagnostics.Process disclosing this information.
  o Bug 2: Information on CPU Affinity mask obtained from .NET System.Diagnostics.Process is ambiguous.
o 1 + 2, bug 3: There is no way I found to script pinning to individual CPUs that is complete.
  o Feature request 1: .NET System.Diagnostics.Process allows only for late binding of Affinity mask. Move Affinity to .NET System.Diagnostics.ProcessStartInfo.
o Feature request 2, consolidate: The various approaches taken by Microsoft seem uncoordinated and incomplete. Even using START command requires decimal number for NUMA node index and hexadecimal number for Affinity mask. cmd.exe START and creation of thread objects allow for early binding of CPU Affinity mask while .NET System.Diagnostics.Process allows only late binding. And so on.
o Feature request 3, give us TASKSET complement: Given all of the above, it is impossible to script the replacement for Linux TASKSET.
o What will happen once single processors with more than 64 CPUs are available?
o Mysql Cluster counts CPUs as if every existing Processor group is complete (has 64 CPUs).


Linux top command on Windows, further investigations

In my previous post, I spoke of "Normalized" and "Non-Normalized" CPU utilization values:


Foreword: Windows is not "process" based OS (like Linux) but rather "thread" based so all of the numbers relating to CPU usage are approximations. I did made a "proper" CPU per Process looping and summing up Threads counter (https://msdn.microsoft.com/en-us/library/aa394279%28v=vs.85%29.aspx) based on PID but that proved too slow given I have ~1 sec to deal with everything. CPU utilization using RAW counters with 1s delay between samples proved to produce a bit more reliable result than just reading Formatted counters but, again, too slow for my 1s ticks (collect sample, wait 1s, collect sample, do the math takes longer than 1s). Thus I use PerfFormatted counters in version 0.9RC.
    Win32_PerfRawData_PerfProc_Process; Win32_PerfFormattedData_PerfProc_Process.
    _PID_     Unique identified of the process.
    PPID      Unique identifier of the process that started this one.
    PrioB     Base priority.
    Name      Name of the process.
    CPUpt_(N) % of CPU used by process. On machines with multiple CPUs,
        this number can be over 100% unless you see _CPUpt_N caption which
        means "Normalized" (i.e. CPUutilization / # of CPUs).
        Toggle Normal/Normalized display by pressing the "p" key.
    Thds      # of threads spawned by the process.
    Hndl      # of handles opened by the process.
    WS(MB)    Total RAM used by the process. Working Set is, basically,
        the set of memory pages touched recently by the threads belonging to
        the process. 
    VM(MB)    Size of the virtual address space in use by the process.
    PM(MB)    The current amount of VM that this process has reserved
        for use in the paging files.
However, my approach for displaying "Non-Normalized" CPU utilization didn't work :-/

Proper functionality of this feature is rather important for my job. Looking at "Normalized" CPU utilization values for a process does not tell you much. Say a process has CPU utilization of 100%. This just tells you there is at least 1 CPU that's fully utilized by the process but it does not tell you the overall utilization. "Non-Normalized" value sums CPU utilization over all CPUs that process uses. In my case, the test box has 8 Xeon processors with 6 physical and 6 virtual cores each totaling at 96 CPUs. The system is configured as such that NUMA node corresponds to 1 Xeon processor (socket). Thus, when my process utilizes entire NUMA node (socket) to the fullest, the CPU utilization for that process should be number of CPUs per Numa/Socket (12) x 100% which is 1200%:

If the process scales correctly, I will see more NUMA nodes/Sockets light up while increasing the load:

However, this does not tell me it is my process of interest that is using the CPUs. To confirm it, I need TOP script showing CPU utilization of above 1200%:

This guarantees me mysqld process is running on more than 2 sockets (sysbench is taking up ~7 CPUs and I bet mydesktopservice is the one lighting up 3rd CPU in 2nd row).

How to make it work:

Heavy rework of #region Tasks job which is starting the "Processes" monitoring job was in order. First, I had to remove all of the below code:
        if ($CPUDSw) {
            Get-CimInstance Win32_PerfFormattedData_PerfProc_Process | 
                select @{Name='_PID_'; Expression={$_.IDProcess}},
                @{Name='PPID'; Expression={$_.CreatingProcessID}},
                @{Name='PrioB'; Expression={$_.PriorityBase}},
                @{Name='Name                  '; Expression={(($_.Name).PadRight(22)).substring
                    (0, [System.Math]::Min(22, ($_.Name).Length))}}, 
                @{Name='_CPUpt__'; Expression={($_.PercentProcessorTime).ToString("0.00").PadLeft(8)}},
                @{Name='Thds'; Expression={$_.ThreadCount}},
                @{Name='Hndl'; Expression={$_.HandleCount}},
                @{Name='WS(MB)'; Expression={[math]::Truncate($_.WorkingSet/1MB)}},
                @{Name='VM(MB)'; Expression = {[math]::Truncate($_.VirtualBytes/1MB)}},
                @{Name='PM(MB)'; Expression={[math]::Truncate($_.PageFileBytes/1MB)}} |
                where { $_._PID_ -gt 0} | &$sb | 
                Select-Object -First $procToDisp | FT * -Auto 1> $pth
        } else {
            Get-CimInstance Win32_PerfFormattedData_PerfProc_Process | 
                select @{Name='_PID_'; Expression={$_.IDProcess}},
                @{Name='PPID'; Expression={$_.CreatingProcessID}},
                @{Name='PrioB'; Expression={$_.PriorityBase}},
                @{Name='Name                  '; Expression={(($_.Name).PadRight(22)).substring
                    (0, [System.Math]::Min(22, ($_.Name).Length))}}, 
                @{Name='_CPUpt_N'; Expression={"{0,8:N2}" -f ($_.PercentProcessorTime / $TotProc)}},
                @{Name='Thds'; Expression={$_.ThreadCount}},
                @{Name='Hndl'; Expression={$_.HandleCount}},
                @{Name='WS(MB)'; Expression={[math]::Truncate($_.WorkingSet/1MB)}},
                @{Name='VM(MB)'; Expression = {[math]::Truncate($_.VirtualBytes/1MB)}},
                @{Name='PM(MB)'; Expression={[math]::Truncate($_.PageFileBytes/1MB)}} |
                where { $_._PID_ -gt 0} | &$sb | 
                Select-Object -First $procToDisp | FT * -Auto 1> $pth
and replace it with Get-Counter version:
        $processes = Get-CimInstance Win32_PerfFormattedData_PerfProc_Process | 
            Select @{Name='_PID_'; Expression={$_.IDProcess}},
            @{Name='PPID'; Expression={$_.CreatingProcessID}},
            @{Name='PrioB'; Expression={$_.PriorityBase}},
            @{Name='Name'; Expression={($_.Name).ToLower()}},
            @{Name='Thds'; Expression={$_.ThreadCount}},
            @{Name='Hndl'; Expression={$_.HandleCount}}, 
            @{Name='WS(MB)'; Expression={[math]::Truncate($_.WorkingSet/1MB)}},
            @{Name='VM(MB)'; Expression = {[math]::Truncate($_.VirtualBytes/1MB)}},
            @{Name='PM(MB)'; Expression={[math]::Truncate($_.PageFileBytes/1MB)}},
            PoolNonpagedBytes, PoolPagedBytes, PercentProcessorTime |
            Where { $_._PID_ -gt 0}

        $Samples = (Get-Counter “\Process(*)\% Processor Time”).CounterSamples

Just noting Get-Counter example:
PS:511 [HEL01]> (Get-Counter “\Process(*)\% Processor Time”).CounterSamples | FL *
Path             : \\hel01\process(system)\% processor time
InstanceName     : system
CookedValue      : 0
RawValue         : 3434062500
SecondValue      : 131012088253272040
MultipleCount    : 1
CounterType      : Timer100Ns
Timestamp        : 29.02.16 09:40:25
Timestamp100NSec : 131012124253270000
Status           : 0
DefaultScale     : 0
TimeBase         : 10000000
Then I had to change the way of putting it all together:
        if ($CPUDSw) { 
            $pcts = $Samples | Select @{Name=”IName"; Expression={($_.InstanceName).ToLower()}}, 
              @{Name=”CPUU”;Expression={[Decimal]::Round(($_.CookedValue), 2)}}
            $processes | select '_PID_', 'PPID', 'PrioB',
                            @{Name='Name                  '; Expression=
                                    (($_.Name).PadRight(22)).substring(0, [System.Math]::Min(22, ($_.Name).Length))
                            @{Name='_CPUpt__'; Expression=
                                    if ($pcts.IName.IndexOf($_.Name) -ge 0) {
                                        ($pcts.CPUU[[array]::IndexOf($pcts.IName, $_.Name)]).ToString("0.00").PadLeft(8)
            'Thds', 'Hndl', 'WS(MB)', 'VM(MB)', 'PM(MB)' | &$sb | Select-Object -First $procToDisp | FT * -Auto 1> $pth
        } else {
            $pcts = $Samples | Select @{Name=”IName"; Expression={($_.InstanceName).ToLower()}}, 
              @{Name=”CPUU”;Expression={[Decimal]::Round(($_.CookedValue / $TotProc), 2)}}
            $processes | select '_PID_', 'PPID', 'PrioB',
                            @{Name='Name                  '; Expression=
                                    (($_.Name).PadRight(22)).substring(0, [System.Math]::Min(22, ($_.Name).Length))
                            @{Name='_CPUpt_N'; Expression=
                                    if ($pcts.IName.IndexOf($_.Name) -ge 0) {
                                        ($pcts.CPUU[[array]::IndexOf($pcts.IName, $_.Name)]).ToString("0.00").PadLeft(8)
                                    } else {
                                        #Not found (yet). Take what you have :-/
            'Thds', 'Hndl', 'WS(MB)', 'VM(MB)', 'PM(MB)' | &$sb | Select-Object -First $procToDisp | FT * -Auto 1> $pth
Since Get-Counter, by default, takes samples 1 second apart:
PS:507 [HEL01]> Measure-Command{(Get-Counter “\Process(*)\% Processor Time”).CounterSamples}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 1
Milliseconds      : 18
Ticks             : 10189709
TotalDays         : 1.17936446759259E-05
TotalHours        : 0.000283047472222222
TotalMinutes      : 0.0169828483333333
TotalSeconds      : 1.0189709
TotalMilliseconds : 1018.9709
I also abandoned all of the code relating to Timer:
    #$sw = New-Object Diagnostics.Stopwatch
    do {
        #if ($sw.ElapsedMilliseconds -lt 1000) {
        #    Start-Sleep -Milliseconds (1000-$sw.ElapsedMilliseconds)

    } while ($true)
    #$sw = $null

So now it works! I do not know right now when I will be able to release the new version so stay tuned.

Final thoughts:

I have hit many many problems in Windows during this testing. Just note, for example, the use of ToLower() in ($_.InstanceName).ToLower() but this is something for the new blog post. This one is about TOP script.