The Design and Implementation of a Log-Structured File System

The Design and Implementation ofa Log-Structured File System Mendel Rosenblum and John K. Ousterhout

Contents • Overview • Motivation • Design and implementation of LFS • Cleaning policy • Evaluation of real implementation • Concluding comments

Overview • Goal • 전체 디스크 사용 효율을 높임 • Method • 작은 랜덤 쓰기 작업 -> 하나의 큰 순차적인 쓰기 작업 • 디스크의 로그 구조 : 모든 쓰기 작업은 “appended” • Key issue • 지움 정책의 효율성

Motivation • Technology trends • CPU와 디스크의 발전 속도 차이가 점점 증가 • 전송율은 적게 증가됨 • 엑세스 타임에 대한 향상은 적음 • 메인 메모리의 크기는 지수적으로(exponentially) 증가 • 큰 파일 캐쉬 : 많은 읽기 요청을 수용 • File system workload • Office and engineering applications : 작은 파일 워크로드 • 작은 랜덤 디스크 I/O를 유발 • LFS는 작은 파일 워크로드에 초점을 둠

Motivation • Problems of other file systems • 넓게 펼쳐진(spreading) 정보 • 매우 많은 작은 단위의 엑세스를 유발 • Ex. dir entry, inode, data block • 쓰기 동기화 문제 • 메타데이터는 일관성을 위해 동기화되도록 쓰여져야 함 • 많은 작은 파일 워크로드에 대해서, 디스크 소통은 동기화된 메타데이터 쓰기작업에 제한적

Contents • Overview • Motivation • Design and implementation of LFS • Cleaning policy • Evaluation of real implementation

Logical structure of file • Indexed structure : same as Unix FFS inode dir entry data block file name inode number metadata block ptr Unix FFS에서 inode의 위치 고정 block ptr … index block block ptr block ptr block ptr

Physical layout in disk • Example of creating 2 files in different directories data blocks cylinder group inodes LFS에서 inode 위치는 고정되지 않음

Segments • Segment : unit of writing and cleaning • 512KB ~ 1024KB Disk : consists of segments + checkpoint region segment 0 segment 1 … segment n checkpoint region … • Segment summary block • Contains each block’s identity : <inode number, offset> • Used to check validness of each block • Modified times for each block

Free space management • Threading and Copying • Sprite LFS는 threading과 copying을 같이 사용 • segment -> in-place • live data -> out-of-place Copy live data out of the log Leave the live data in place

Inode map <Physical location> Segment usage table <Bytes of valid data, last modified time> Additional structures inode0 inode1 … segment0 segment1 … segment usage table ptrs inode map block ptrs checkpoint region checkpoint region checkpoint region의 위치는 고정

Operations • Read a block • Inode map block ptr -> inode map block -> inode -> data block • Write a block • Data block, inode, inode map block, segment usage table block • Update inode map table ptr, segment summary block, segment usage table In memory Same as FFS 메모리에 있는 현재 세그먼트 used not used

Crash recovery • Checkpoint • 주기적으로 혹은 사용자의 요구시, inode map table ptrs, segment usage table ptrs 에 씀 • Consistent state : 메모리에 남겨진 수정된 데이터가 없음 • Roll-forward • 만약 crash가 발생하면, • 가장 최근의 checkpoint의 쓰여진 로그를 살펴봄 checkpoints crash roll-forward

Cleaning policy • Cleaning • Read segments -> collect valid data -> write segments : 소거(clean) 세그먼트가 발생 • 4 problems • when? • how many segments? • Segment selection policy - most fragmented • Block redistribution policy • files in the same directory • aging sort : 최종 수정 시간으로 정렬 Major concern

Measurement :write cost • Write cost • 새로운 데이터 쓰기의 바이트에 대한 디스크 활동 평균 총 시간[다수의 모든 범위 쓰기 작업] • UNIX FFS : seek/rotational time • LFS : cleaning overhead • Ex. write cost 10.0 : 90% time is wasted • Ideal case : 1.0 (모든 범위의 활용률 의미)

Write cost of LFS • No seek/rotational time in LFS • 쓰기 비용은 소거 중 “복사된 총 데이터”에 의해 결정 • Goal : 소거된 세그먼트에 valid 데이터를 감소 u : 소거된 세그먼트의 활용률

Tradeoff : cost & utilization • LFS에서 cost-performance와 utilization과 tradeoff 관계 • Bimodal segment distribution

Simulation based research • Simulator • 디스크는 4-KB files로 채움 • 엑세스 패턴의 발생 • Uniform : random • Hot-and-cold : 90% writes to 10% “hot” files, 10% writes to 90% “cold” files

Simulated policy • Segment selection • Greedy : 최소 사용된 세그먼트 선택 • Block redistribution • No redistribution : used in random workload • Age sorting : used in hot-and-cold workload • Age : last modified time of file, 파일의 모든 블록이 같은 age를 가짐

First result • locality 혹은 “better” 재분배(redistribution)가 “worse” 성능의 결과를 보임 FFS FFS improved • ---- : hot-and-cold (age sorting) • ___ : uniform Logging, delayed write, disk request sorting

Analysis • Hot segments are more frequently cleaned • hot-and-cold에서 소거된 세그먼트의 활용이 uniform보다 높음

Cost-benefit selection policy • Segment selection • Rationale • cold segment 는 더 천천히 invalid 블록을 발생 • cold segment의 수정된 블록은 더 많은 “값(value)”을 가짐 • 1 : a cost to read segment • u : write back the live data

Result

Implementation study • Implementation complexity • FFS와 대부분 같음 • But, FFS can reuse codes • Sprite network operating system의 구현 • Installed in 5 different disk partitions used by about 30 user

Micro-benchmarks • Small file workload, no cleaning happened (best case performance) • create/delete case는 대략 10배 정도 FFS보다 빠름 • expectation of performance improvement with faster processor • FFS is disk- bound : 85% utilized (Cf. LFS : 17%)

Micro-benchmarks • Large file workload, no cleaning happened • 100MB file, write & read performance (5 phases are run in sequence) New write creating file Overwrite to existing file

Long term usage statistics • Collected over a 4-month period • About 70% of bandwidth utilized (write cost 1.2~1.6 : bandwidth 63~83%) • Segment utilization of /user6 partition • Large number of fully utilized and totally empty segments

Critics on LFS • LFS의 성능 향상 이득은 최상인가? • 메타데이터 집중 워크로드에서 뛰어남 • 읽기/쓰기의 일반적인 I/O 성능은 Sun-FFS와 비슷하거나 적음 • LFS 읽기 성능은 일반적으로 FFS보다 적음 • 지움(cleaning)에 대한 오버헤드는 성능을 저하시킴 • Sun-FFS 구현 비용은 LFS보다 훨씬 적음

The Design and Implementation of a Log-Structured File System