17 Matching Annotations
  1. Sep 2017
    1. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data.

      Very interesting, basically Singularity allows containers to run in HPC environments, so that code running in the container can take advantage of the HPC tools, like massive scale and message passing, while at the same time keeping the stuff in the container safer.

  2. Aug 2017
    1. If zone reclaim is switched on, the kernel still attempts to keep the reclaim pass as lightweight as possible. By default, reclaim will be restricted to unmapped page-cache pages. The frequency of reclaim passes can be further reduced by setting /proc/sys/vm/min_unmapped_ratio to the percentage of memory that must contain unmapped pages for the system to run a reclaim pass. The default is 1 percent.

      This is a percentage of the total pages in each zone. Zone reclaim will only occur if more than this percentage of pages are in a state that zone_reclaim_mode allows to be reclaimed.

      If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared against all file-backed unmapped pages including swapcache pages and tmpfs files. Otherwise, only unmapped pages backed by normal files but not tmpfs files and similar are considered.

      Source

    2. There is a knob in the kernel that determines how the situation is to be treated in /proc/sys/vm/zone_reclaim. A value of 0 means that no local reclaim should take place. A value of 1 tells the kernel that a reclaim pass should be run in order to avoid allocations from the other node. On boot- up a mode is chosen based on the largest NUMA distance in the system.

      This appears to be /proc/sys/vm/zone_reclaim_mode now.

    3. There has been some recent work in making the scheduler NUMA-aware to ensure that the pages of a process can be moved back to the local node, but that work is available only in Linux 3.8 and later, and is not considered mature.

      Stamped2 KNL nodes are already running 3.10, so this is likely available.

    4. The active memory allocation policies for all memory segments of a process (and information that shows how much memory was actually allocated from which node) can be seen by determining the process id and then looking at the contents of /proc/<pid>/numa_maps.
    5. How memory is allocated under NUMA is determined by a memory policy. Policies can be specified for memory ranges in a process's address space, or for a process or the system as a whole. Policies for a process override the system policy, and policies for a specific memory range override a process's policy.
    6. The main performance issues typically involve large structures that are accessed frequently by the threads of the application from all memory nodes and that often contain information that needs to be shared among all threads. These are best placed using interleaving so that the objects are distributed over all available nodes.
    7. In general, small Unix tools and small applications work very well with this approach. Large applications that make use of a significant percentage of total system memory and of a majority of the processors on the system will often benefit from explicit tuning or software modifications that take advantage of NUMA.
    8. Modern processors have multiple memory ports, and the latency of access to memory varies depending even on the position of the core on the die relative to the controller. Future generations of processors will have increasing differences in performance as more cores on chip necessitate more sophisticated caching.
    9. A memory access from one socket to memory from another has additional latency overhead to accessing local memory—it requires the traversal of the memory interconnect first.
    1. Some people think that these system calls are a good way to improve the performance of a high-performance process on a system. A common use case I’ve seen in the real world is to try to call mlockall() on a program that’s supposed to running with very high performance. The reasoning is that if the program is paged out to disk, that will reduce performance; therefore mlockall() will improve things.If you try to actually use mlockall() in this way you might run into some difficulties because most systems have a very low default ulimit on the number of pages a process can lock. With some twiddling of the default ulimits you can get this working, but perhaps it’s worth considering why the default ulimits are so low in the first place.
  3. Jul 2017
    1. evolution from PCI 1.0 through PCI-Express 5.0

      While the evolution of PCIe speed is definitely of interest, especially as it keeps pace with network speeds, the total number of PCIe lanes also a significant barrier to I/O for many systems... Especially in HPDA.

      We can effectively double network throughput by dropping in another 16x NIC. This becomes less possible if there are not enough slots (or perhaps more importantly if available PCIe lanes are oversubscribed). This becomes even more of an issue, as the author points out, with the advent of NVMe.

      Intel has a vested interest in keeping the number of PCIe lanes at 40 with Xeon and holding back implementation of PCIe 4.0. They provide proprietary high speed I/O to their Xeon Phi coprocessor and Optane memory products. This doesn't allow GPUs, FPGAs and competing NV memory products to compete on equal footing.

      AMD is somewhat breaking the stalemate with Zen Naples offering 128 PCIe 3.0 lanes. Will have to see if OEMs build systems that expose all of that I/O.

  4. Jun 2016
    1. Docker is a type of virtual machine

      How does it compare to the packages installed directly? Could be useful for development, but maybe not practical for HPC applications. Maybe just create a cd iso with all the correct programs and their dependencies.

  5. May 2014
    1. Specifically, we explore three key usage modes (see Figure 1): • HPC in the Cloud , in which researchers out - source entire applications to current public and/ or private Cloud platforms; • HPC plus Cloud , focused on exploring scenarios in which clouds can complement HPC/grid re - sources with cloud services to support science and engineering application workflows—for ex - ample, to support heterogeneous requirements or unexpected spikes in demand; and • HPC as a Service , focused on exposing HPC/grid resources using elastic on-demand cloud abstrac - tions, aiming to combine the flexibility of cloud models with the performance of HPC systems

      Three key usage modes for HPC & Cloud:

      • HPC in the Cloud
      • HPC plus Cloud
      • HPC as a Service
  6. Apr 2014
    1. Over the last twenty years, the open source community has provided more and more software on which the world’s High Performance Computing (HPC) systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. But although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost be cause of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual PetaScale systems and between different systems. It seems clear that this completely unco ordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transact ional memory, speculative execution, and GPUs. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.