Metadata Extractors, Content Transformers & Renditions

Metadata Extractors, Content Transformers & Renditions • Neil Mc Erlean

Who am I? • Lead Engineer in the Services Team • 4 years at Alfresco (since 3.2) • Previously worked on • Hybrid Sync • Alfresco in the Cloud • Various services/components • Transformers & Extractors • REST APIs • Actions & Behaviours and more… • Ex-astrophysicist (of which more later)

Talk content • What data is in your content? • How does Alfresco get at it? • What does Alfresco do with it? • How can you use these features? • Introductory material • no prior knowledge assumed

Talk content - Breaking it down • Your content & its metadata • Alternative renditions of your content • Overviews of the 3 services • Java Foundation APIs. JavaScript. • Configuring & extending Alfresco. • All code samples available as runnable tests - download from the website.

#1 Metadata Extraction

#2 Content Transformation • Alfresco uses them to produce • images (thumbnails) • plain text (indexing) • inter-Office transforms • Also generally useful

#3 Rendition Service • Very similar to transformations • More general service • More than just content to content

How do these components work? • Mostly by leveraging existing OSS Java libs • Notably Apache Tika • Some external OS processes too • OpenOffice.org (OOo), LibreOffice • ImageMagick • pdf2swf (swftools) • Some bespoke impls e.g. zip - txt • ‘embedded’ thumbnails/previews iWorks, Office

General Considerations • CPU, memory • In process vs. out of process vs. Remote CPU • Selection of ‘best’ extractor/transformer • Stay for Andy Hunt’s talk for Support’s troubleshooting tips

Metadata Extraction

#1 Metadata Extraction • Triggered on content creation or update. • or on demand • ‘Best’ available extractor obtained from MetadataExtracterRegistry. • This Extractor pulls out the metadata. • Format depends on the extractor lib/impl. • key/value pairs • These data are mapped onto the Alfresco content model • configurable mapping. <ExtractorClass>.properties

Metadata extraction - Java • MetadataExtracterRegistry registry = appContext.getBean("metadataExtracterRegistry”, • MetadataExtracterRegistry.class); • ContentReader reader = • contentService.getReader(nodeRef, • ContentModel.PROP_CONTENT); • MetadataExtracter extractor = registry.getExtracter(reader.getMimetype()); • Map<QName, Serializable> props = • new HashMap<QName, Serializable>(); • extractor.extract(reader, • OverwritePolicy.EAGER, props);

Overwrite Policy – when re-extracting • EAGER • extracted value is not null • PRUDENT • db property doesn’t exist or is null or “” (+ above) • CAUTIOUS • existing property == undefined

<ExtractorClass>.properties mapping • namespace.prefix.cm=http://www.alfresco.org/model/content/1.0 • author=cm:author • title=cm:title • #Note need to escape ‘:’ in key name • geo\:lat=cm:latitude • geo\:long=cm:longitude

Mapping properties • Can map extracted key-value onto multiple content properties • Can ignore extracted key-values i.e. not map.

Metadata extraction - JavaScript • var action = • actions.create('extract-metadata'); • action.execute(nodeRef);

Ways to customise & extend • Customisation of existing extractors • Define new mappings – to an existing or a new content model. • Adding new extractors • Identify 3rd party lib that can read the binary file • Or write your own code to do this • Extend AbstractMappingMetadataExtracter • Or write a Tika plugin • Define metadata mappings • org.alfresco.repo.content.metadata

Recap • Metadata extraction harvests ‘hidden’ data and maps it into Alfresco content model. • Support for many MIME types • Metadata insertion coming • it’s on HEAD but currently disabled • also maps metadata tags to cm:taggable • “Best” extractor selection covered below

Content Transformers

Out of the box transformers • text, html, xml • Microsoft Office (doc & docx formats) • OpenDocument Format • iWorks (Keynote, Pages, Numbers) • Images • Shockwave Flash (SWF) • RFC822 email, Outlook .msg email • Adobe PDF, Illustrator, PSD • Electronic publication (epub) • Rich Text (RTF) • MP3 • Archives (ZIP, tar) • Many more

Available transformers • No ‘graph’ of transform paths/mime types • Spring beans extend “baseContentTransformer” • They implement isTransformable(from, to) • They can be • simple (A to B) • ‘complex’ (A to C, via B) • failover (A to B, A to B…) • overlapping (multiple beans for same path) • dynamically un/available (e.g. OOo)

/api/service/mimetypes webscript • http://localhost:8080/alfresco/service/mimetypes • MIME types • Metadata Extractors • Content Transformers • As services come and go (OOo), entries may disappear

/api/service/mimetypes webscript • application/vnd.openxmlformats-officedocument.presentationml.presentation - pptx • Extractors: org.alfresco.repo.content.metadata.PoiMetadataExtracter • Transformable To: • application/pdf = Using a Direct Open Office Connection • application/vnd.ms-powerpoint = Using a Direct Open Office Connection • application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection • application/x-shockwave-flash = Complex via: application/pdf • image/jpeg = Complex via: application/pdf • image/png = Complex via: application/pdf • text/html = org.alfresco.repo.content.transform.TikaAutoContentTransformer • text/plain = org.alfresco.repo.content.transform.TikaAutoContentTransformer • text/xml = org.alfresco.repo.content.transform.TikaAutoContentTransformer • Transformable From: application/vnd.ms-powerpoint = Using a Direct Open Office Connection • application/vnd.oasis.opendocument.presentation = Using a Direct Open Office Connection

“Best” transformer selection • Alfresco prefers • available transformers (obviously) • ‘explicit’ transformers • previously fast transformers* • Alfresco doesn’t understand the output quality • pass/fail • fast/slow • * past performance is not a guide to future performance.

Content Transformation - Java • ContentTransformerRegistry registry = • appContext.getBean("contentTransformerRegistry”); • ContentReader reader = contentService.getReader • (nodeRef, ContentModel.PROP_CONTENT); • ContentWriter writer = contentService.getWriter • (targetNode, ContentModel.PROP_CONTENT, true); • writer.setEncoding("UTF-8”); • writer.setMimetype(MimetypeMap.MIMETYPE_TEXT_PLAIN); • // Now have a reader & writer ready to go

Content Transformation – Java ctd. • ContentTransformer transformer = • registry.getTransformer • (MimetypeMap.MIMETYPE_ZIP, • reader.getSize(), • MimetypeMap.MIMETYPE_TEXT_PLAIN, null); • transformer.transform(reader, writer);

Content Transformation - JavaScript • var action = actions.create('transform'); • action.parameters["destination-folder"] = node.parent; • action.parameters["assoc-type"] = • "{http://www.alfresco.org/model/content/1.0}contains"; • action.parameters["assoc-name"] = • node.name + "transformed"; • action.parameters["mime-type"] = "text/plain"; • action.execute(testNode);

Config: Transformer Filtering/Debugging • org.alfresco.service.cmr.repository. • TransformationOptionLimits • timeouts, size limits, page limits • content.transformer.OpenOffice. mimeTypeLimits.txt.pdf. maxSourceSizeKBytes=5120 • org.alfresco.repo.content.TransformerDebug • contextual logging

Extending • Follow the Alfresco patterns • org.alfresco.repo.content.transform • Remember the chains • Remember the subsystems • ImageMagick • OpenOffice • Remember the Enterprise variants • JodConverter

Recap • Many transformations & paths possible • No graph • Can be expensive in CPU/memory • Transformation to text = free indexing • No link between source & transformed content • Thumbnails are children of their source nodes • Bespoke behaviours ensure thumbnails are updated

Renditions

Renditions • A more general feature than transformers • Although with a strong overlap • Thumbnails are renditions • Previews are renditions • Not all renditions are thumbnails/previews

Renditions • Flexible location • Always associated to their source node. • Child nodes of their source node. • Child nodes of another folder node. • Updated when their source updates. • Can be disabled with marker aspect • rn:preventRenditions • See ‘preventRenditions’ spring bean to register other ‘unrenditionable’ content classes • Can reflect the content and/or metadata of their source node.

Standard rendition engines • reformat redirects to vanilla transforms • image image manipulation parameters • freemarker run some FTL against source content • xslt run XSLT on (XML) source node • composite rendition series [reformat, crop]

Persistence of Rendition Definitions • Create Rendition Definition • Set parameter values on it • Execute it against a source node • Definitions can be persisted • Useful for complex or commonly used • RenditionService.save(), .load() • Saved into Alfresco’s Data Dictionary

Renditions - Java NodeRef jpgNodeRef; QName renditionName = QName.createQName(NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn"); RenditionDefinition renditionDef = renditionService.createRenditionDefinition (renditionName, "imageRenderingEngine"); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_WIDTH, 128); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_RESIZE_HEIGHT, 512); renditionDef.setParameterValue( ImageRenderingEngine.PARAM_MAINTAIN_ASPECT_RATIO, false); ChildAssociationRef chAssRef = renditionService.render(jpgNodeRef, renditionDef);

Renditions - JavaScript • var renditionDef = renditionService • .createRenditionDefinition("cm:cropResize”, • "imageRenderingEngine"); • renditionDef.parameters["destination-path-template”] • = "/Company Home/Cropped Images/${name}.jpg"; • renditionDef.parameters["isAbsolute"] = true; • renditionDef.parameters["xSize"] = 50; • renditionDef.parameters["ySize"] = 50; • renditionService.render(testNode, renditionDef); • var renditions = renditionService.getRenditions(testNode);

Recap • Renditions == Transformations++ • More complex, more powerful

End

Metadata Extractors, Content Transformers & Renditions