360 likes | 374 Views
Changes to Sizing Spread Sheet for Documentum 5.3. Documentum Performance Group. Agenda. Changes to the Customer Input Page Changes to the Output Page Some Sizing Examples. Changes to the Customer Input Page. App server cluster support in WDK/Webtop Fulltext query rate Fulltext space.
E N D
Changes to Sizing Spread Sheet for Documentum 5.3 Documentum Performance Group
Agenda • Changes to the Customer Input Page • Changes to the Output Page • Some Sizing Examples
Changes to the Customer Input Page • App server cluster support in WDK/Webtop • Fulltext query rate • Fulltext space
App Server Cluster support overhead Will Factor in CPU cost associated with Session Serialization in Clustered HA environment
5.3 Sizing changes for WDK • 5.3 webtop consumes 40% more CPU than 5.2.5 • Due partly to inclusion of new features (drag & drop) and infrastructure changes • This overhead is being reduced for SP1. Sizing spreadsheet for SP1 will reflect this. • 5.3 App Server cluster support has an additional 50% overhead • This is due to cost of replicating state • Is Worst case: memory-based replication (between two App servers) • To be reduced in 5.3 SP1, will be reflected in SP1 spreadsheet
Fulltext query rate Will Factor in CPU cost associated large numbers of full text queries
Fulltext indexing Characteristics Most sizing requests specify docs/day, but normally that load is not for 365 days out of the year
Fulltext indexing Characteristics Will Factor in CPU cost and Disk I/O associated with the indexing portion of fulltext
Fulltext indexing Characteristics: Options • None = No full text indexing enabled • Immediate Indexing = Attempt to minimize index time from 'save' to 'searchable‘ • Default for 5.3 • Expensive relative to disk space, CPU utilization, and I/O • Delayed Indexing = Attempt to reduce disk space, memory, or CPU util at cost to ‘save to searchable’ latency • Initial Focus: Transient Disk Space tuning • Requires some detailed Index Server tuning
Transient Fulltext Index Space Tuning Transient Space needs for building a large partition with all documents Transient Space needs for building four small equal sized partitions within Index More information on this tuning to be provided in an FAQ
Fulltext space consumption Will Factor in content information for fulltext disk space and CPU calculations
Output Page changes Hardware resources needed for Index Agent and Index Server
Option #2 is changed to reflect likely “Content Server and Indexing Servers” on same host scenario
Example Option #2 Index Agent • Content Server & Indexing software on same host Pros: - Easy to install and administer - Grow Capacity by adding more CPUs, disk, and memory Cons: - Resource contention risks - Footprint of Indexing subsystem could exceed excess capacity of a pre-5.3 production system Dftxml msg Index Server (FAST) Staging Area Meta data & content Query & results Index Content Server Content
Option #3 is changed to add Index Agent/Index Server on separate host scenario Note: The initial release will not cover multi-node configurations of the Index Agent/Server
Additional Supported Scenarios for FCS Index Agent • All Full Text Components on a Separate host Pros: • Separates resource consumption “new” 5.3 full text from a rest of Content Server • Likely to arise in upgrade scenarios from 5.2.X Cons: • Additional server required Dftxml msg Index Server (FAST) Staging Area Meta data & content Query & results Index Content Server Content
Sizing Exercises • Generic document repository (< 2 million docs) • Large system: 100,000 docs/day
Generic Document repository • Provided System characteristics: • Upgrade from 5.2.5 (repository already existing) • Total size of system < 1 million objects • Total content Size = 240 GB • Ingest: ½ GB/day • Approximately 1,000 objects/day • Average file size ½ MB • Less than 1000 users (20 active at any one time)
questions: How much of the content might be fulltext indexable? • Check size and number of objects by format • Example: • 40 GB of the 240 TB is of content is of a format that can be indexed • Less than 500,000 objects have content that can be indexed • About 360,000 objects have content that can’t be indexed • At least 102 separate formats! • However, Word and PDF dominate the content space that can be fulltext indexed (90%) • All objects have at least their meta-data indexed
Enter average size, number of docs, and whether content can be indexed for 4 rows below: • Word: 106K byte average, 160,000 docs, content indexed=Y • PDF: 352K bytes average, 56,000 docs, content indexed=Y • Other: 20K bytes average, 275,000 docs, content indexed=Y • Images: 550K bytes average, 360,000 docs, content indexed=N
What would that imply for hardware to do upgrade? • So far we haven’t calculated growth • Estimate for space needed for fulltext: 19 GB
What about growth? • Assume 260 busy days in the year and 1,000 docs per busy day • Assume document proportions remain the same: • Word (19%) 19% of 1000 = 190 190 x 260 = 49,400/yr • PDF (7%) 7% of 1000 = 70 70 x 260 = 18,200/yr • Other (32%) 32% of 1000 = 320 83,000 /yr • Image (42%) 42% of 1000 = 420 109,000 /yr
Index Size after 3 years • Around 30 GB needed
Could I size fulltext as a simple 40% of total content size? • Old, tried and true(?) method • It can, especially if “Non-indexable” content could dominate! • In this example [system without growth] • 40% of 240 GB = 96 GB vs. 19 GB • Example with system including growth • 40% of 385 GB = 154 GB vs. 30 GB • For small systems, the cost of overestimating is small
Other notes • Index Subsystem can co-reside with Content Server • Existing system must have spare CPU capacity & memory capacity • New fulltext index should reside on high capacity disk array or SAN, not on NAS device or single disk • At 1+ million docs the indexing side could bottleneck on the disk • Spreadsheet shows minimal disk I/O requirements, but these are averages spread over 24 hour periods • actual ones will be higher during indexing process
Large system (100,000 docs per day) • Provided System characteristics: • Ingest: 110GB/day • The data is primarily static once submitted • Approximately 100,000 objects/day • Average file size 1MB • Average metadata size per file: 10kb • Estimated total: 4TB in 3 years on Tier 1- 120TB on Tier 2, 1TB database • Tier 1 Storage - Symmetrix for 30 days • Tier 2 Storage - Centera • Initial pilot: 50 users • 10% of objects/capacity applying text search
Initial observations on provided information • How many days in a year will see 100,000 docs/day? • Lets assume 260 busy days a year • If weekend load rate significant then it should be factored into average per day • 100,000 docs/day x 260 days/yr x 3 yrs = 78 million docs • This is more than can be handled by 5.3 FCS! • 5.3 SP1 features are needed • 5.3 SP1 features needed for large systems • Ability for single repository to have multiple “collections” • Multi-node Index Server support
5.3 Large Full Text support: FCS vs. SP1 • In 5.3 FCS each Content Server repository is mapped to a single Index Server “collection” • In 5.3 SP1: • Collections can be mapped to a single index search “column” • Content Server will be able to have multiple collections per repository • Index Agent to provide mapping of “a_storage_type” to Index Server collection • This can be used to “range partition” the fulltext data • Once a collection reaches a certain size ( < 10 million) data can be routed to • Older “static” data can be put in older collections • CPU burn no longer needed to rebuild older collections
Which area in the spreadsheet should I enter the document profile?
Normally, this input area could be used exclusively • This assumes about 40% of the original content size is for fulltext • Probably not a big deal for small repositories, but could potentially lead to large overestimate for ones like this (with 78 million docs)
What does “10% of objects/capacity applying text search” mean? • Does it mean: “10% of objects will have content to full text”? • Does it mean: “10% of the objects will be fulltext indexed”? • Does it mean: “10% of the searches will be fulltext (as opposed to just the attributes)”? • Assume the first one. • Note that the “Content Loading” area does not allow you to model this!!
10% word docs to have content FT indexed Space consumption based on meta-data + content 90% images to have only meta-data fulltext indexed Space consumption only on attributes fulltext indexed Alternate model
Other model (con’t) • Uses an alternate space calculation • Reflects that most documents will just have a small amount of meta-data to fulltext index • Total fulltext index size now 4 TB vs. 30 TB of previous
Other model (con’t): note CPU • Note that the CPU’s have not changed between models • This is incorrect (initial model should have at least twice the CPUs as stated) • To be fixed in an upcoming version of spreadsheet
Other items to worry about • Disk I/O needs (I/O’s per sec) reported in spreadsheet reflect average (over loading period) not peak needed values • To reach high throughput fulltext Disk I/O subsystem needs to always be able achieve several hundred I/O’s per second • Do not put fulltext index on single drive (except in case of tiny repository)