410 likes | 836 Views
Chapter-7: Case Study www.AhsanAbdullah.com. Background . Data being recorded for decades by several organization, mostly never digitized and never used for decision making. Under-utilization.
E N D
Chapter-7: Case Study www.AhsanAbdullah.com (c) 2008 Dr. Ahsan Abdullah
Background • Data being recorded for decades by several organization, mostly never digitized and never used for decision making. Under-utilization. • Data is horizontally wide i.e. 100+ attributes and vertically deep i.e. tens of thousands of rows. • Huge potential for long-term and short-term decision making. • Decision making not data driven, but based on “expert” judgment, sometimes with tragic results. (c) 2008 Dr. Ahsan Abdullah
Area under study • Punjab is the bread-basket of Pakistan, and administratively divided into eight divisions. • Multan division is the cotton hub of Punjab, which consists of district Multan. • District Multan has three tehsils, which are furhter divided into central points or Markaz. • This study is about 10 tehsils of Multan using data for years 2001 and 2002. (c) 2008 Dr. Ahsan Abdullah
Area under study: Map • Key Markaz • 1 Bosan • 2 Qadirpurran • 3 Multan • Makhdum Rashid • Mumtazabad • 6 Shujabad • 7 Hafizwala • 8 Jalalpur Pirwala • 9 Qasba Marral (c) 2008 Dr. Ahsan Abdullah
Major Players • A pest is an insect that eats the crop. There are two types of cotton pests: • Sucking pests (White Fly, Jassid, Thrips etc.) • Boll Worms (spotted boll-worm, pink boll-worm) • A predator is an insect that eats the pest. • Such as Lady bug Beetle, Spiders, Ants etc. • Cotton is also effected by virus • Cotton Leaf Curl Virus (CLCV) (c) 2008 Dr. Ahsan Abdullah
Economic Threshold Level (ETL_A) • The pest population beyond which it is cost effective to use pesticide. • Pesticide is a poison which is used to kill pests. • Note that eradicating pests is NOT feasible, controlling pest population is feasible. (c) 2008 Dr. Ahsan Abdullah
Pests & Predators Pests Predators Source:United States Department of Agriculture (USDA) (c) 2008 Dr. Ahsan Abdullah
ETL_A: Graph Economic Injury ETL_A Pest Population Time (c) 2008 Dr. Ahsan Abdullah
The need • Extensive losses to cotton crop due to pest attacks in 1983 resulted in the establishment of Directorate of Pest Warning in 1984. • Since 1984 scouts from the Directorate have been sampling fields and recording data and advising farmers. • During 2003-04 season, Boll Worm attack on the cotton crop resulted in a loss of nearly 0.5 M bales. • Weather not the only factor, but points to a multitude of factors, requiring efficient and effective data analysis, for better decision making. (c) 2008 Dr. Ahsan Abdullah
The need: IT in Agriculture • The volume of pest scouting data accumulated todate is enormous both horizontally and vertically. • A typical pest scouting sheet consists of 35 variables or attributes. • Metrological data consists of 50+ variables. • Coarse estimate of pest scouting data recorded for the cotton crop alone stands at 5+ million records, and growing. • Tasking the human brain alone, for synthesis of information from this data is not only impractical but dangerous too. • Need a Data Warehouse, OLAP tools and Data Mining to analyze the data. (c) 2008 Dr. Ahsan Abdullah
Agro-Informatics • “I.T. sector is at the heart of the economic revival of Pakistan”President of Pakistan, Launching of VU, Mar. 23, 2003. • Agriculture is the backbone of our economy, upto 74% of the population is dependent on it. • IT is an enabler, and has the potential to benefit everyone when applied in Agriculture. • IT + Agriculture: A win-win scenario. (c) 2008 Dr. Ahsan Abdullah
Agro-Informatics: Relevant URL To know more about Agro-Informatics, visit www.agroict.org (c) 2008 Dr. Ahsan Abdullah
How to go about? • Discussed several DWH implementation methodologies in lectures 32-35. • Will adopt a pilot project approach, because: • A full-blown DWH requires extensive investment. • Show users the value of DSS. • Establish blue print for full-blown system. • Identify problem areas. • Reveal true data demographics. • Pilot projects are supposed to work with limited data. (c) 2008 Dr. Ahsan Abdullah
The 12-step Approach of Shaku Atre (c) 2008 Dr. Ahsan Abdullah
Step-1: Determine User’s needs • A data warehouse or a data mart? • User not educated, think DWH is a big Excel sheet. • Some confuse DSS with Decision Making Systems. • DWH at Federal level. • Data marts at district level. • What are the business objectives? • Good governance via E-Governance. • Timely and data driven decision making. • Fore warning of farmers. • Timely planning for purchase/stocking of pesticides. (c) 2008 Dr. Ahsan Abdullah
Step-1: Determine Users needs • Cost/benefit analysis, project estimation & risk assessment • Cost of a prototype system fraction of the loss in one year. • Required data either not made available or expensive. • Required data in very raw format. • People hungry for ideas looking for a solution on a plate. • Where do I get the data from? • Data available at Directorate of pest warning. • Raw form • Metrological department asking too much for data collected using public funds.. • Weather forecasts downloaded from web site of www.dawn.com (c) 2008 Dr. Ahsan Abdullah
Steps-2&3: Determine DBMS Sever & Hardware Platform • Hardware and Software along with trained people was already available on Campus. • The pilot DWH was implemented using NCR Teradata Data Warehousing solution for NT running on a dual Intel 950 Mhz Xeon processors server and 1GB of RAM. • Total internal Hard Disk capacity of the server amounted to 36 GB while external RAID control supports 8 additional SCSIs of 18 GB each. (c) 2008 Dr. Ahsan Abdullah
Step-4: Information & Data Modeling Dimensional Modeling (c) 2008 Dr. Ahsan Abdullah
Step-4: Information & Data Modeling Simplified ERD Other field inputs such as irrigation, fertilizer etc. not included as data not available (c) 2008 Dr. Ahsan Abdullah
Step-4: Information & Data Modeling KEY WFN: White Fly Nymph WFA: White Fly Adult W: White B: Brown S: Small Larvae L: Large Larvae (optional) Logical & Physical Design (c) 2008 Dr. Ahsan Abdullah
Step-5: Construct Metadata Repository • What means what was important. • Who access what was not important as single user environment. • Meetings held similar to requirements definition, and questions transformed into SQL queries. • Meta data issues still cropped up, more of business rules. (c) 2008 Dr. Ahsan Abdullah
Step-5: Surprise case Sucking pests Ball Worm Complex SBW: Spotted Ball Worm ABW: Army Ball Worm PBW: Pink Ball Worm If pest population is low, predator population will also be low, because there will be less “food” for predators to live on i.e. pests. (c) 2008 Dr. Ahsan Abdullah
Step-6: Data Acquisition & Cleansing Hand filled pest scouting sheet Typed pest scouting sheet (c) 2008 Dr. Ahsan Abdullah
Step-6: Issues • The pest scouting sheets are larger than A4 size (8.5” x 11”), hence the right end was cropped when scanned on a flat-bed A4 size scanner. • The right part of the scouting sheet is also the most troublesome, because of pesticide names for a single record typed on multiple lines i.e. for multiple farmers. • As a first step, OCR (Optical Character Reader) based image to text transformation of the pest scouting sheets was attempted. But it did not work even for relatively clean sheets with very high scanning resolutions. • Subsequently DEO’s (Data Entry Operators) were employed to digitize the scouting sheets by typing. (c) 2008 Dr. Ahsan Abdullah
Step-6: Why the issues? • Major issues of data cleansing had arisen due to data processing and handling at four levels by different groups of people • Hand recordings by the scouts at the field level. • Typing hand recordings into data sheets at the DPWQCP office. • Photocopying of the typed sheets by DPWQCP personnel. • Data entry or digitization by hired data entry operators. (c) 2008 Dr. Ahsan Abdullah
Step-7: Transform, Transport & Populate (c) 2008 Dr. Ahsan Abdullah
Motivation For Transformation • Trivial queries give wrong results. • Static and dynamic attributes • Static attributes recorded repeatedly. (c) 2008 Dr. Ahsan Abdullah
Step-7: Resolving the issue • Solution: Individualization of cultivated fields. • Technique similar to BSN used to fix names. • Unique ID assigned to farmers. • BSN used again, and unique ID assigned to fields. • Results: • Limitation: Field individualization not perfect. Some cases of farmers with same geography, sowing date, same variety and same area. Such cases were dropped. (c) 2008 Dr. Ahsan Abdullah
Step-8: Middleware Connectivity • Since the source data is maintained in a non digital format, hence connectivity with the data warehouse was irrelevant. • Once digitized, it was rather straightforward to load data into the warehouse. • Furthermore, in the foreseeable future, it was not anticipated that the scouting sheets were going to be maintained in a digitized form. (c) 2008 Dr. Ahsan Abdullah
Step-9-11: Prototyping, Querying & Reporting • Implemented the prototype with user involvement. • Applications developed • 10. A data mining tool was also developed based on an indigenous technique that used crossing minimization paradigm for unsupervised clustering. • 11. A low-cost OLAP tool was indigenously developed; actually it was a Multi dimensional OLAP or MOLAP. • Use querying & reporting tools • The following SQL query was used for validation: SELECT Date_of_Visit, AVG(Predators), …………………………AVG(Dose1+Dose2+Dose3+Dose4) FROM Scouting_Data WHERE Date_of_Visit < #12/31/2001# and predators > 0 GROUP BY Date_of_Visit; (c) 2008 Dr. Ahsan Abdullah
Step-12: Deployment & System Management Since a pilot project, therefore, the traditional deployment methodologies and system management techniques were not followed to the word, and are not discussed here. (c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Data Validation • Quality and validity of the underlying data is the key to meaningful and authentic analysis. • After ensuring a satisfactory level of data quality (based on cost-benefit trade-off) extremely important to scientifically validate the data that the DWH will constitute. • Some very natural checks were employed for this purpose. Relationship between the pesticide spraying and predator (beneficial insects) population is a fact well understood by agriculturists. • Predator population decreases as pesticide spray increases and then continually decreases till the end of season. (c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Data Validation Graph (c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Spray Dates • Working Behaviors at Field Level: Spray dates • As expected, apparently spray dates and spray frequency for 2001 and 2002 do not displayed any correlation; as it is dependent on pest populations, availability of pesticides etc. • For deep analysis, moving average of sprays for five days, and a moving correlation of sprays for five days were calculated and plotted. • For the sake of uniformity, the moving average of spray was normalized using the maximum spray frequency. (c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Spray Dates Graph No relationship should have existed for the two years. But note the surprising finding that most sprays occurring on and around 12th Aug. in BOTH years with high correlation, appearing as a spike. Also note the dip in sprays around 11th Sep.! Sowing at predetermined time makes sense, as it is under the control of the farmer, but that is not true for spraying. Pests don’t follow calendars; therefore, whenever, ETL_A is crossed pesticides are sprayed. (c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Explaining Findings 14th Aug. is the independence day of Pakistan and a national holiday. In Pakistan, people are in a habit of sandwiching gazetted holidays with casual leaves; consequently businesses are closed for a longer period, including that of pesticide suppliers. 14th Aug. occurred on Tue and Wed in 2001 and 2002, respectively, thus making it ideal to stretch the weekend. During Aug/Sep. humidity is also high, with correspondingly high chances of pest infestations. Therefore, apparently the farmers decided not to take any chances, and started spraying around 11th Aug.; evidently even when it was not required. The weather forecast for 13 Aug. 2001 and 2002 was showers and cloudy, respectively. Therefore, most likely the pesticide sprayed was washed-off. Decline in sprays around 9/11 could not be explained. (c) 2008 Dr. Ahsan Abdullah
Agri-DSS usage: Sowing Dates The results of querying the sowing date based on the day of the week are shown below: Observe least number of sowings done on Thursdays, in each year. This finding was later confirmed by extension personnel. Multan is famous for its shrines. Thursdays are usually related with religious festivals and activities, a mix of devotion and recreation, and usually held at shrines, hence a tendency of doing less work on Thursdays. Similar behavior was observed for spraying too. (c) 2008 Dr. Ahsan Abdullah
Conclusions & Lessons • ETL is a big issue. There are no digitized operational databases, data entry of these sheets is very expensive, slow and prone to errors. • Each farmer is repeatedly visited by agriculture extension people. This results in repetition of information, about land, sowing date, variety etc. Hence, farmer and land individualization are critical. Such an individualization task is hard to implement for multiple reasons. • There is a skewness in the scouting data. Public extension personnel (scouts) are more likely to visit educated or progressive farmers, as it makes their job of data collection easy. Furthermore, large land owners and influential farmers are also more frequently visited by the scouts. Thus the data does not give a true statistical picture of the farmer demographics. • All pests are not present all the time (such as ball worms), most of the time second spray is not done (or not recorded), and hence tables are sparse. Therefore, it is better to split tables so as to decrease header size and table space. • Unlike traditional data warehouse where the end users are decision makers, here the decision-making goes all the way “down” to the extension level. This presents a challenge to the analytical operations’ designer, as the findings must be fairly simple to understand and communicate. (c) 2008 Dr. Ahsan Abdullah