Pages:     | 1 |   ...   | 61 | 62 || 64 | 65 |   ...   | 82 |

Necessity for the base remote management in each cluster node separately, an opportunity of implementation of such operations as startup-shutdown of node, the console with the output of load of node have demanded installation of ServNET [11]. Further it is planned to use nodes only with support of the IPMI interface [12] of 270 Intelligent Systems version above 1.5, node providing the remote startup-shutdown at presence only of the Ethernet cable and feed connected to the node, and the function of SERIAL-OVER-LAN in IPMI 2.0+ allows even remotely to adjust the nodes BIOS.

IP-network. As supporting in a cluster systems an IP-network is used with a few ranges of private IP-addresses:,,, thus for the cluster nodes used private range is as most spacious. The following chart of distributing of subnet of IP-addresses is applied in the last:

A computing node has an IP-address 10.N.M.X, where N is number of cluster, M is number of switch, X is number of port in the switch. Thus, is the first node of the cluster #1, and is twenty fourth node of the cluster #3.

Mask of IP-address, i.e. all network infrastructure is fully attainable from any point, here differentiating of different clusters is executed by VLANs. The nodes of different clusters are mutually invisible as a result, but the systems of storehouse will be accessible even in case if the functions of storehouse and managing cluster node are laid on one device. The switch of networks of cluster management has the fixed IP-address: 10.N.M.250, where N is number of cluster, M is the number of switch.

The frontend (managind node of cluster) has the fixed IP-address: 10.N.M.254, where N is number of cluster, M is the number of switch.

The subnet is given for services, so is an access server address, ( are devices of UPS and etc File service. As a rule, parallel tasks are focused on the computings connected to huge files of the initial, intermediate or final data. So, the analysis of results of nuclear researches with terabyte size of the initial data, and a tasks in a package of quantum chemistry Gamess [8] can use hundreds creates time files in the size in some gigabytes on process with usual read - write the intermediate results by small, fine packages. Therefore the extremely important problem is to give to nodes high-speed access to storehouse systems of the huge sizes.

For increase in throughput of system of a storehouse it is used PORT TRUNKING - aggregation of 2-4 network interfaces in one for increasing the common new throughput though a linear gain is not possible.

As the distributed file system of cluster system SCIT still recently it was used NFS. As cluster nodes have no own disks each node during initial loading mounts root file system by NFS. Besides by NFS operating system sections with the working data of tasks were mounted also. Choice of NFS has been caused by the several reasons is a standard network file system, NFS is present in any UNIX-system, NFS is very easily adjusted and configured.

Operating experience NFS within one year as the basic file system has shown, that NFS is an excellent choice only for small (on 4-8 nodes) clusters, for clusters a level the SCIT, on 16-32 nodes, NFS can be a quite good choice under condition of use for the account of tasks with a small amount of operations of input-output with disk files. However NFS becomes a unacceptable choice at use for the task execution with an intensive input - output.

Therefore file service has been modified, and the primary goals of modification were:

a choice of the optimal distributed file system with an opportunity of scaling as on volume about an opportunity to node existing storehouse various clusters in one common data storage and on the maximal throughput;

transition from use NFS on partial or full use of the chosen file system.

The following candidates for a role of the distributed file system were examined:

GFS manufactures RedHat (earlier SISTINA.COM) [13], for today last version 6.1. GFS uses as the distributed storehouse mounted simultaneously in all units GNBD (global network block device) atop of which GFS works actually with the manager of blocking.

GFS has many advantages - free-of-charge decisions very much, development by the largest manufacturer RedHat Linux, work it is direct from a box at use RedHat Enterprise Linux 4 and is higher or Fedora Core 4 and is higher, quite good scalability on volume, ease in installation and configuration.

During too time there are also lacks - bad scalability on the general throughput, and it means, that for escalating capacity it is necessary to use expensive hardware decisions such as FiberChannel as all nodes are shared with one block device refusal of one of nodes can lead to to some damages of file system.

XII-th International Conference "Knowledge - Dialogue - Solution" GFS is a quite good choice for the finished decision when it is not planned to increase computing cluster capacity and volume of system of a storehouse, i.e. delivery on a turn-key basis. For use in our case supposing the further escalating of capacity on computing resources and volumes of disk space, approaches a little.

OCFS2 manufactures ORACLE [14], the successor with open codes OCFS. While the stable version of system still is not present, but it already is in kernel Linux, is supported by various distribution kits Linux. Actually represents distributed on all cluster nodes a RAID5-file that gives both high speed of read - write, and some fault tolerance of all file. However, if some units have broken down, the file can collapse down to loss of all data, i.e.

the system demands a highly reliable disk subsystem on each cluster node, that very strongly increases a total price of the decision. OCFS2 it is optimum for processing the big databases for what actually and it was created.

Lustre manufactures CLUSTERFS.COM [1], the commercial, free-of-charge version, leaves with some backlog, is maximum for one year. It is very hard in installation, but it is very simple in configuration. It is perfectly scaled both on volume, and on throughput.

The system uses a set of patches to a kernel and consequently there can be problems of its construction, is especial in case of use of a various sort of the non-standard equipment. Problems basically are connected to a binding of interconnect drivers and driver Lustre to determined and not always to the same versions of kernel Linux. especially big such discrepancies arise at simultaneous use of architecture IA64, proprietary drivers SCI from firm SCALI and file system Lustre.

Thus it is very simple in configuration, it is enough to tell, that after big amount of works on construction Lustre and starting adjustment of cluster startup of node demands literally some minutes.

From the point of view of system Lustre looks as usual local file system with all pluses as aggressive caching inodes and dentry. Realization of full compatibility with POSIX is expected at third quarter of 2006 (the call flock/lockf doesnt realized now). Escalating of volume and throughput is made by simple addition in system of one or several nodes with disks (OSS). As each file can "be stripped" on several OSS and thus access to it made be parallel throughput grows practically in an arithmetic progression, i.e. the more at us is established OSS, the above speed of read-write. Thus very high degree of recycling of devices of an input-output so File I/O exceeds 90 % raw bandwidth disks is reached{achieved}, and single GigE end-to-end throughput reaches 118 MB/s at a physical maximum of the interface 125 MB/s. Additional plus is that as the network interface in Lustre any interface supporting report IP can practically act, and in some cases and more low level protocol (for example Infiniband).

High enough requirements on reliability are showed to storehouse, that is clear, as now integrity of file system depends on serviceability of all components entirely. And, though at what refusal or OSS-node cluster nodes can continue work if their data have not been located on the damaged node, but parts of the data all the same can be lost (in following versions Lustre, except for a mode of storage of files stripe or RAID0, modes RAID1 for small and RAID5 for the big files will be realized, therefore the probability of full loss of the data will be sharply reduced). Therefore use on OSS-units of RAID-files with redundancy is not the recommendation but the requirement.

Requirements to performance of OSS-nodes are very low. As the node is occupied with one task execution - maintenance of an input-output, any modern processor fulfils this task with success at low final cost. Moreover, idle computing powers are well enough to realize the programmed RAID-array function, i.e. they are saved money for expensive controllers (modern RAID-controllers lose to program RAID-arrays in speed for the banal reason - controllers uses weak enough CPU, actually for last years 5 RAID-controllers have found support RAID6,10,50,60, trunks PCI-X and PCI-E, but their computing capacities have remained at the same level of five years' prescription and in competition of frequencies wins more high-speed processor).

As there is some blank in Lustre performance its a search of files in the catalogue, 5000 op/s is very low figure and in some cases results in falling productivity (for example one of programs of the user created in the working catalogue about hundred thousand files and degradation of speed was appreciable). However this feature easily manages accommodation of working files not in one but in tens or hundreds catalogues.

After the analysis existing parallel (as above-named, and some other) file systems and an estimation of our technical opportunities we were defined that our specifications quite corresponded to requirements to Lustre and this file system has been chosen as the major candidate for a role of the distributed file system for ours supercomputers.

272 Intelligent Systems Early 2006 clusters SCIT have been transferred to use of distributed file system Lustre. It has allowed to unit all storehouses in one common file system in volume 1.7 Tbyte physically general file system settles down on three servers of the data (OSS) with four disk files (OSD) and one server of the metadata (MDS). As at configuration Lustre we have specified to distribute a file on all OSD (actually it is classical RAID-0 in application to a file) thus we could distribute loading on a file input - output simultaneously on all servers.

Results of testing of two file systems, Lustre and NFS, on a file in the size in 8 Gbyte (in testing are measured:

throughput - Kb / c, use of the processor, frequency of search) are resulted in table 2.

Table 2.

Sequential Output Sequential Input Operation Per Char Block Rewrite Per Char Block Random seek Kbps %CPU Kbps %CPU Kbps %CPU Kbps %CPU Kbps %CPU k/sec %CPU NFS 26665 88.1 27907 6.3 3134 95.2 29215 91.1 84975 15.3 460.7 2.Lustre 27791 99.3 69991 41.3 39668 58.0 28066 98.9 98254 86.1 121.6 17.Further it is possible to increase volume and throughput of file storehouse by simple connection to the switchboard of additional servers with disk files and small reconfiguration of the system.

3. The Selection of Architecture Features to a Supercomputer Project What characteristics may be selected for the new supercomputer project. Proceeding the cluster tasks from mentioned early specific properties, it is possible to formulate the commons requirements to the node of cluster for the effective decision of parallel tasks:

productivity of node linearly depends on power of processor, and productivity of processor from frequency descriptions of the used bus of main memory and amount of main memory accessible in a node (to some reasonable limit);

interprocessor data exchange always faster than a interconnect exchange, i.e. preferably to use multiprocessor nodes (with 24 processors) and multicore processors;

productivity of node depends on as used interconnect, two features are here important is latency, i.e. delay arising up at the transmission of minimum package between nodes, and maximal carrying capacity;

productivity of node depends on intensity of operations of input-output with the devices of storehouse.

Pipeline and systems calls. As a rule, parallel tasks executed at computing node are not used with cyclic algorithms, therefore classic architecture with a short pipeline, used in the processors AMD, much preferably architectures P4 processors INTEL. Every reference to the data of neighbouring process is accompanied by a few transitions in the kernel mode of processor. Price of this transition on the processors of AMD 120-240 times, on the processors of architecture P4 1100-1300 times. However with appearance the recently represented architecture of Intel Conroe and actually by returning of Intel to architecture of P-III and short pipeline, in the second half of 2006 year and first half of 2007 year, i.e. down to the appearance of AMD K8L architecture, placing of forces will be completely other.

HyperThreading. Due to idle time of one of pipelines in incorrectly predicted transition or simply impossibility of parallel execution of instruction on architecture P4 there is possibility of the use of standing resources as a virtual processor (HyperThreading), but in parallel tasks it results only in falling of productivity. The reason is simple the data exchange between nodes aligns productivity of all processes on speed the slowest and, as on a virtual processor is no more than 40% real processor, general productivity falls in 23 times, i.e. this possibility for clusters is practically unavailing.

XII-th International Conference "Knowledge - Dialogue - Solution" 64 bits versus 32 bits. For today all modern processors either support 64bits expansions (AMD64, EM64T) or are pure 64-bits processors. Unfortunately, now a prize from the use of word length in 64 bits the programs requiring for calculations with such arithmetic collect in 64 bits receive only, and that not always, the other only lose. The reasons for this are few (due to size of data and address megascopic twice):

It is required to increase twice processor cache, otherwise there is falling of productivity at the frequent washing of a cache.

Pages:     | 1 |   ...   | 61 | 62 || 64 | 65 |   ...   | 82 |

2011 www.dissers.ru -

, .
, , , , 1-2 .