Reference: How to Calculate Hadoop Cluster Size

깔끔하게 Storage 와 노드 수를 정리해 놓은 글이 있어 퍼왔다.

This is a formula to estimate Hadoop storage (H):

H=crS/(1-i)

where:

  • c = average compression ratio. It depends on the type of compression used (Snappy, LZOP, ...) and size of the data. When no compression is used, c=1.

  • r = replication factor. It is usually 3 in a production cluster.

  • S = size of data to be moved to Hadoop. This could be a combination of historical data and incremental data. The incremental data can be daily for example and projected over a period of time (3 years for example).

  • i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to storing intermediate results of Map phases.

Example: With no compression i.e. c=1, a replication factor of 3, an intermediate factor of .25=1/4 H= 13S/(1-1/4)=3S/(3/4)=4S With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.

**2. This is the formula to estimate the number of data nodes (n): ** n= H/d = crS/(1-i)*d

where d= disk space available per node. All other parameters remain the same as in 1.

Example: If 8TB is the available disk space per node (10 disks with 1 TB , 2 disk for operating system etc were excluded.). Assuming initial data size is 600 TB. n= 600/8=75 data nodes needed


+ Recent posts