270 likes | 434 Views
Building Greenstone Collections from the Command Line. Basic commands. Type “setup.bat” (for Windows users) or “setup.sh” for (Unix/Linux users) when you’re in the Greenstone installation directory To create a collection, type “perl –S mkcol.pl –creator youremail@somewhere.com collection_name”
E N D
Basic commands • Type “setup.bat” (for Windows users) or “setup.sh” for (Unix/Linux users) when you’re in the Greenstone installation directory • To create a collection, type “perl –S mkcol.pl –creator youremail@somewhere.com collection_name” • To import documents into a collection, type “perl –S import.pl collection_name” • To build a collection, type “perl –S buildcol.pl collection_name” • For further details, read page 9 – 19 of the developer’s guide
Documents Documents Documents Building A Collection In Greenstone XML documents Browsing and full text Web Import Archives Index import.pl (plugins) build.pl (classifiers)
Importing documents • Plugins are used to process source documents in different formats and associate the corresponding metadata to them • The output of this process is XML documents encoded in the Greenstone Archive format specified by the following DTD <!DOCTYPE GreenstoneArchive [ <!ELEMENT Section (Description,Content,Section*)> <!ELEMENT Description (Metadata*)> <!ELEMENT Content (#PCDATA)> <!ELEMENT Metadata (#PCDATA)> <ATTLIST Metadata name CDATA #REQUIRED> ]>
Automating collection building tasks • Batch files can automate many of the tasks • You can create a batch file to import and rebuild a collection • Try copy and paste the following lines into a batch file named “rebuild.bat”: Perl –S import.pl –removeold %1 Perl –S buildcol.pl %1 • Execute the batch file by typing “rebuild.bat collection_name” • There are many commands that you can combined in a batch file
Importing documents (cont.) • An example: <Section> <Description> <Metadata name="gsdlsourcefilename">ec158e.txt</Metadata> <Metadata name="Title">Freshwater Resources in Arid Lands</Metadata> <Metadata name="Identifier">HASH0158f56086efffe592636058</Metadata> <Metadata name="gsdlassocfile">cover.jpg:image/jpeg:</Metadata> <Metadata name="gsdlassocfile">p07a.png:image/png:</Metadata> </Description> <Section> • Note: gsdlsourcefile is the original file from which the Greenstone archive file was generated, and gsdlassocfile is File associated with the document (e.g. an image file)
Document Metadata • Greenstone Plugins recognize only a small set of metadata tags • There are three ways to assign metadata to documents in a collection: 1) index.txt, 2) metadata.xml and 3) modify an existing Greenstone plugin • An index.txt file is a space separated file that assigns a list of metadata to documents in a collection. It should be placed in the collection import directory
Document Metadata (cont.) • To inform Greenstone about the existence of this file, include the IndexPlug plugin in your collect.cfg file or add this plugin to your plugin list in GLI • An example of the index.txt file is as follows: key: Title Date Cast Director "analyze.html" "Analyze That" "2002" "Robert De Niro, Billy Crystal, Lisa Kudrow" "Harold Ramis“ "majestic.html" "Majestic, The" "2001" "Jim Carrey, Bob Balaban, Jeffrey DeMunn" "Frank Darabont“ • Each of the fields in this file are seperated by a space and enclosed in double quotes. Their offsets are matched with the listing of fields shown in the first lien of the file • Note that the first field of this listing must be the filename of a document • The trailers collection uses this approach to assign metadata to documents in a collection
Document Metadata (cont.) • The second approach uses an XML file to assign metadata to documents in a collection • To inform Greenstone that you would like to use the metadata.xml file, include the string “plugin RecPlug -use_metadata_files” in your collect.cfg file or check the use_metadata_files flag after clicking on the configure plugin button in the GLI • The benefits of using an XML file over the previous approach is that the browser can perform tag checking for you
Document Metadata (cont.) <?xml version="1.0" ?> <DirectoryMetadata> <FileSet> <FileName>MARTYN_DR_02002066.html</FileName> <Description> <Metadata name="PlayerID">MARTYN_DR_02002066</Metadata> <Metadata name="PlayerProfile"></Metadata> <Metadata name="PlayerName">Damien Richard Martyn</Metadata> <Metadata name="FullSizeImage">http://www-usa.cricket.org//perl/picture.cgi/030730</Metadata> <Metadata name="ThumbnailImage">http://www-usa.cricket.org//perl/picture.cgi/030730/inline?alt=1</Metadata> <Metadata name="CoverImage">MARTYN_DR_02002066.jpg</Metadata> <Metadata name="Country">Australia</Metadata> <Metadata name="BattingStyle">Right Hand Bat</Metadata> <Metadata name="BowlingStyle">Right Arm Medium</Metadata> </Description> </FileSet> <FileSet> <FileName>POTHECARY_JE_03001137.html</FileName> <Description> <Metadata name="PlayerID">POTHECARY_JE_03001137</Metadata> <Metadata name="PlayerProfile"></Metadata> <Metadata name="PlayerName">James Edward Pothecary</Metadata> <Metadata name="Country">South Africa</Metadata> <Metadata name="BattingStyle">Right Hand Bat</Metadata> <Metadata name="BowlingStyle">Right Arm Medium</Metadata> </Description> </FileSet> • Can you recognize the XML structure this uses?
Document Metadata (cont.) • Here’s the answer: <DirectoryMetadata> <FileSet> <FileName>text </FileName> <Description> <Metadata name=“name1">some text</Metadata> <Metadata name=" name 2"> some text </Metadata> other Metadata tags… </Description> </FileSet> other FileSet tags … <DirectoryMetadata> • Note that XML is case sensative • The cricket collection uses the metadata.xml to assign metadata to the documents
Document Metadata (cont.) • We can also customize a plugin to extract metadata from a document • We will look at modifying the TextPlug to extract Ratings, Genre and Subject from a few documents in the trailers collection
Structuring Documents into Sections • Sometimes source documents have to be structured into sections and subsections • This can be done easily by incorporating the following HTML tags into your documents: <!-- <Section> <Description> <Metadata name="Title"> Realizing human rights for poor people: Strategies for achieving the international development targets </Metadata> </Description> --> (text of section goes here) <!-- </Section> --> • You can also embed subsections within another section by embedding another level of <Section> before the </Section> tag • Look at one of the HTML files in the demo collection for an example
Types of Browsing Indexes • SectionList • AZList • AZSectionList • DateList • Hierarchy
Creating Browsing Indexes • Certain classifiers generate browsing structures that are hierarchical • They are useful for subject classifications and organization hierarchies • Therefore specific hierarchies will have to be provided using the flag –hfile <filename> when the classifier is defined in the collect.cfg file • For example: classify Hierarchy –hfile sub.txt –metadata Subject –sort Title
Creating Browsing Indexes (cont.) • Note that sub.txt has to reside in the /etc directory • Certain classifiers don’t require explicit hierarchies to be defined. For instance, the AZList, DateList and List classifiers that generates a selection list of the corresponding metadata classify List –metadata Howto classify AZList –metadata Title
Creating Browsing Indexes (cont.) • Explicit hierarchies have to be define according to the following format: <identifier> <position in hierarchy> <name> • For example: 1 1 “General reference” 1.2 1.2 “Something else” 2 2 “….” • What this means is that the metadata type associated to the current classifier will be assigned to the first classification if it has the value 1 within the document • Look at the demo collections for examples
Creating Browsing Indexes (cont.) • Documents are treated internally as tree nodes by Greenstone • There are three types of nodes: Vlist, Hist and Datelist • For example, an AZList consists of a collection of Vlist nodes that represent documents • Arguments accepted by various classifiers are in page 48 of the developer’s guide
Formatting Browsing Indexes • Each classifier has an implicit name from its position in the collect.cfg file. For example, the third classifier specified in the file is called CL3 • Tags in the formatting strings: • [Text] – document text • [link] … [/link] – link to the document itself • [icon] – icon representing the resource • [metadata-name] – value of the metadata associated to this document
Formatting Browsing Indexes (cont.) • For example: format CL4Vlist “<br>[link][Howto][/link]” • Conditional statements are supported in the formatting string. They are enclosed by the ‘{’ and ‘}’ characters in these formats: {If}{[metadata], then clause, else clause} {Or}{action, another-action, another-action, etc} • The {If} statement is the same as most program languages • The {Or} statement evaluates the items in the list and stops when one of them is non-null. Its value is sent to the output and evaluation is terminated.
Formatting Browsing Indexes (cont.) • For example: format VList "<td valign=top>[link]<img src=_httpprefix_/collect/cricket/images/[PlayerID].jpg border=0></link></td><td>[link][Title][/link]</td><td>{If} {[HasAudio],<a href=[audioURL]><img src=_httpprefix_/collect/cricket/images/wav.jpg border=0></a>}</td>"
Customizing the look and feel of Greenstone • Involved files are in gsdl/macros directory: • Base.dm – global macros, such as custom buttons • English.dm – text for the corresponding language • Home.dm – The main GSDL page • Gsdl.dm – About Greenstone page • Style.dm – Page layout • Query.dm – Query form layout
Customizing the look and feel of Greenstone (cont.) • Background image (chalk.gif) Base.dm: _httpiconchalk_ {_httpimg_/chalk.gif} _widthchalk_ {2000} _heightchalk_ {10} • Custom Button Base.dm: _Genrewidth_ {_widthtGenrex_} _imageGenre_ {_gsimage_(_httpbrowseGenre_,_httpicontGenreof_,_httpicontGenreon_,Genre,_textimageGenre_)} _icontabGenregreen_ {<img src="_httpicontGenregr_" width=_widthtGenrex_ border=0>} _icontabGenregreen_[v=1] {_texticontabGenregreen_}
Customizing the look and feel of Greenstone (cont.) • Document.dm _textGenrepage_ {_texticonhGenre_} _iconGenrepage_ {<img src="_httpiconhGenre_" width="_widthhGenre_" height="_heighthGenre_">} _iconGenrepage_ [v=1] {<h2>_texticonhGenre_</h2>}
Customizing the look and feel of Greenstone (cont.) • English.dm _textimageGenre_ {Browse by Genre} _texticontabGenregreen_{Genre} _httpicontGenregr_{_httpimg_/tGenregr.gif} _httpicontGenreon_{_httpimg_/tGenreon.gif} _httpicontGenreof_{_httpimg_/tGenreof.gif} _widthtGenrex_ {114} _texticonhGenre_ {Genre} _httpiconhGenre_ {_httpimg_/h\_Genre.gif} _widthhGenre_ {250} _heighthGenre_ {57} _textGenreshort_ {access publications by Genre} _textGenrelong_ { <p>You can <i>access my documents by whatever I have defined</i> by pressing the <i>Genre</i> button. This brings up a list of documents. }