750 likes | 928 Views
Information Retrieval, Social Network Analysis, and Knowledge Discovery for Homeland Security. Siddharth Kaza (sidd@u.arizona.edu) MIS 596A Mar 21, 2007. Outline. Homeland Security Centers of Excellence Information Retrieval using COPLINK Social Network Analysis of Criminal Networks
E N D
Information Retrieval, Social Network Analysis, and Knowledge Discovery for Homeland Security SiddharthKaza (sidd@u.arizona.edu) MIS 596A Mar 21, 2007
Outline • Homeland Security Centers of Excellence • Information Retrieval using COPLINK • Social Network Analysis of Criminal Networks • Criminal Network Visualizer: System Demo • Mutual Information Analysis to Identify High-risk Vehicles
Homeland Security Centers of Excellence • The Department of Homeland Security (DHS) Science and Technology office funds and establishes Centers of Excellence (COE) to conduct fundamental research and education for HS issues. • Each center is led by a university in collaboration with partners from other institutions, agencies, laboratories, think tanks, and the private sector. • Generate and disseminate knowledge and technology to advance the homeland security mission. • Create and leverage intellectual capital and nurture a homeland security science and engineering workforce.
Eight COEs • The Center for Risk and Economic Analysis of Terrorism Events (CREATE), led by the University of Southern California, evaluates the risks, costs, and consequences of terrorism. • The National Center for Food Protection and Defense (NCFPD), led by the University of Minnesota, defends the safety of the food system from pre-farm inputs through consumption by establishing best practices. • The National Center for Foreign Animal and Zoonotic Disease Defense (FAZD), led by Texas A&M University, protects against the introduction of high-consequence foreign animal and zoonotic diseases. • The National Consortium for the Study of Terrorism and Responses to Terrorism (START), led by the University of Maryland, informs decisions on how to disrupt terrorists and terrorist groups.
Eight COEs (cont.) • The National Center for the Study of Preparedness and Catastrophic Event Response (PACER), led by Johns Hopkins University, optimizes our Nation's preparedness in the event of a high-consequence natural or man-made disaster. • The Center for Advancing Microbial Risk Assessment (CAMRA), led by Michigan State University, fills critical gaps in risk assessments for decontaminating microbiological threats. • The University Affiliate Centers to the Institute for Discrete Sciences (IDS-UACs) are led by Rutgers University, USC, UIUC, and the University of Pittsburgh. • The Regional Visualization and Analytics Centers (RVACs), led by Penn State University, Purdue University, Stanford University and others, conduct research on visually-based analytic techniques that help people gain insight from complex, conflicting, and changing information.
COE for Border Security and Immigration (ninth COE solicitation) • Major research areas: • Surveillance, Screening, Data Fusion, and Situational Awareness • How can border inspection processes be strengthened? • How can high-risk traffic be distinguished from low-risk traffic? • What emerging methods of collecting, fusing, integrating and analyzing information offer promise of increasing border security? • Population Dynamics and Immigration Administration • Command, Control, and Communications • Immigration Policy and Effects • Border Risk Management
Addressing Major Issues Border Crossing Data (AZ, CA, TX) High-risk Vehicle Identification Vehicles People Screening Situational Awareness Data Fusion and Analysis Law-enforcement Data Identify high-risk vehicles using association techniques like mutual information to incorporate crossing frequency and law-enforcement data. AZ CA TX Analysis of Criminal Networks Geo-coded Illegal Alien/Vehicle Data Spatio-temporal Analysis Network Analysis AZ CA TX Information Retrieval Study criminal networks using social network analysis techniques to understand and predict crime patterns.
Outline • Homeland Security Centers of Excellence • Information Retrieval using COPLINK • Social Network Analysis of Criminal Networks • Criminal Network Visualizer: System Demo • Mutual Information Analysis to Identify High-risk Vehicles
Information Retrieval using COPLINK • COPLINK is an information retrieval and analysis system that integrates information from multiple law-enforcement agencies. • It incorporates algorithms for cross-jurisdictional social network analysis, knowledge discovery, and visualization for intelligence, border safety, and national security applications.
Records Management System (RMS) Mug Shots Database Gang Database Tucson Police Department Records System Multiple Isolated Data Sources within a Single Agency
Phoenix Police Department Systems Tucson Police Department Systems Pima County Systems Isolated Agencies Share Limited Information through State and Federal Systems
Records Management System (RMS) Gang Database Mugshots Database Provide Access to Information using One Friendly Interface
Consolidated Information Provides Opportunities for Analytical and Data Analysis Applications
Query Parameters and Filters Running the query with filters.
Person Search Results A search of White males named Mike 20-35, 5’5” to 6’3” 150 to 250 lbs returns a generic set of results (24 persons).
Outline • Homeland Security Centers of Excellence • Information Retrieval using COPLINK • Social Network Analysis of Criminal Networks • Criminal Network Visualizer: System Demo • Mutual Information Analysis to Identify High-risk Vehicles
Criminal Activity Networks • Criminal Activity Networks (CAN) are networks of people, vehicles and locations that are linked by law enforcement information. • These networks allow us to understand the complex relationships between people and vehicles. • Analysis of the topological characteristics of these networks helps better understand their governing mechanisms. • In this study we analyze the topological characteristics of CANs of people and vehicles in a multiple jurisdiction scenario to support border and transportation security.
Literature Review • Criminal Activity Network extraction • Previous studies of complex networks • Topological characteristics of networks • The theory of growth in networks
Criminal Activity Network Extraction • The extraction of CANs involves analyzing information from many different datasets. • Accessing information from multiple sources poses many challenges that are documented in literature. [Garcia-Molina, 2002; Rahm, 2001] • This study uses the BorderSafe information sharing and analysis framework. [Marshall et al., 2004] • Using the framework, law enforcement and other datasets are accessed such that they are amenable for network extraction and analysis.
Complex Networks: Previous Studies • There have been various studies to understand the characteristics of large and complex networks. • The studies have explored the topology, evolution, robustness and other properties of real world networks. • The World Wide Web [Albert, Jeong and Barabasi, 1999; Kumar et al., 2000] • Cellular and metabolism networks [Jeong et al., 2000] • Citation networks [Redner, 1998] • Most real world networks were found to have similar topological and evolutionary characteristics. [Albert and Barabasi, 2002]
Topological Characteristics • Topological characteristics are used to study networks at a macro level. • Three concepts dominate the statistical study of topology: [Albert and Barabasi, 2002] • Small world • Despite the large size of networks, nodes often have relatively short paths between them. • Clustering • The tendency of nodes to cluster together to form cliques, representing circles of friends in which every member knows every other member. • Degree distribution • The distribution of edges among nodes, where different nodes may have different number of edges.
Small World • The small world concept is important as it can depict the communications within a network. • Communication can range from the spread of disease in human populations and spread of viruses on the Internet to passage of messages and commands in a criminal network. • The small world property of a network is measured by the average path length. [Albert and Barabasi, 2002] • The average shortest path length of many real networks have been measured. • Movie actors were found to be an average distance of 3.65 from each other. [Watts & Strogatz, 1998] • Average paths between co-authors in MEDLINE were 4.6. [Newman, 2001] • Shortest path lengths of social networks are small due to the presence of shortcuts between otherwise distant people. [Watts, 1999; Nishikawa et al, 2002 ]
Clustering • Individuals in social networks often form cliques. • Examples of cliques in social network include authors collaborating together in a co-authorship network and websites pointing to each other on the web. • The tendency to form cliques is measured by the clustering coefficient (CC) which is a ratio of the number of edges that exist in a network to the total number of possible edges. [Albert and Barabasi, 2002] • Real networks tend to have high CC often compared to random graphs: • Movie actors: 0.79 [Watts & Strogatz, 1998] • MEDLINE co-authorship: 0.066 [Newman, 2001] • The CC in a criminal network points to the tendency of individuals to collaborate together and partner in crimes.
Degree Distribution • Nodes in a network have different number of edges connecting them. The number of edges connected to a node is called its degree. • The spread in node degrees is given by a distribution function P(k), which gives the probability that a randomly selected node has exactly ‘k’ edges. [Albert and Barabasi, 2002] • The distribution functions of most real world networks follow power law scaling with varying exponents: • Movie actors: exponent of 2.3. [Watts & Strogatz, 1998] • Medline co-authorship: exponent of 1.2. [Newman, 2001] • In criminal networks, high degree of individuals may imply their leadership. [Xu and Chen, 2004] • The degrees of nodes are also used to study the growth and evolution of networks.
Growth in Networks • Most real world networks (including CANs) are not static and grow due to the addition of nodes and/or edges. • The growth of networks changes their topological characteristics. • Two mechanisms govern evolving networks: [Barabasi and Albert, 1999; Dorogovtsev, Mendes and Samukhin, 2000; Newman, 2001] • Growth: networks expand continuously by adding new nodes and, • Preferential attachment: new nodes attach preferentially to nodes that are already well connected.
Preferential Attachment • Network growth involves adding new nodes (and edges) to the set of current nodes. • Preferential attachment assumes that the probability that a new node will connect to an existing node i depends on the degree of the node. • The higher the degree of the existing node, higher the probability that new nodes will attach to it. • The functional form of preferential attachment ((k)) for a network can be measured by observing the nodes present in the network and their degrees [Albert and Barabasi, 2002]
Preferential Attachment: Previous Studies • ∏(k) for co-authorship, citation, actor and the Internet networks was found to follow the power law distribution.[Jeong, Neda and Barabasi, 2003; Newman, 2001] • However, in some cases (k) may grow linearly up to a point and then fall off at high degrees. [Newman, 2001] • This implies that the high degree nodes are not able to attract more newer nodes. • Constraints to growth are also seen in criminal networks.
Constraints on Growth of a Network • Constraints on the number of links a node can attract may be due to:[Amaral et al, 2000] • Aging: Since the growth of the network may be over time, some high degree nodes might become too old to participate in the network. (e.g., actors in a movie network) • Cost: It might become costly for a node to attach to a large number of nodes. • Constraints on the growth of networks may be domain specific and have been studied in many domains: • In plant-animal pollination networks, some animals cannot pollinate certain plants: hence a link cannot be established. [Jordano, Basocompte and Olesen, 2003] • In criminal networks, trust may restrict the growth of networks. Criminals and terrorists do not include many people in their inner trust circle. [Klerks, 2001]
Research Questions • What are the topological characteristics of criminal networks? • How does cross-jurisdictional data affect the topological characteristics of criminal networks? • How do criminal networks grow on adding data from more jurisdictions?
TPD PCSD Incidents 2.99 million 2.18 million Individuals 1.44 million 1.31 million Vehicles 675,000 520,000 Research Testbed • The testbed for this study contains incident reports of all the individuals and vehicles involved in crimes in the jurisdiction of Tucson Police Department (TPD) and Pima County Sheriff’s Department (PCSD) from 1990 to 2002. • A CAN consists of individuals and vehicles represented as nodes and police incidents represented as edges. • Two nodes have an edge between them when they are involved in the same police incident. • Narcotics networks are extracted from the testbed.
Research Design • The study is divided into three parts: • Characteristics of criminal networks in a single jurisdiction. • Narcotics networks that include individuals and incidents reported in a single jurisdiction are analyzed. • Characteristics of the networks by combining data from multiple jurisdictions. • Narcotics networks including individuals and incidents reported in both TPD and PCSD are analyzed. • The implications of the topological properties of these networks are explained in the law enforcement domain.
TPD PCSD Nodes 31,478 individuals 11,173 individuals Edges 82,696 67,106 Giant component 22,393 (70%) 10,610 (94%) 2nd largest component 41 103 Link density 0.0002 0.0008 Experiment Results Narcotics Networks in a Single Jurisdiction Basic Statistics
Experiment Results TPD PCSD Clustering Coefficient 0.39 (1.39 x 10-4) 0.53 (4.08 x 10-4) Average Shortest Path Length (L) 5.09 4.62 Diameter 22 23 Single Jurisdiction (cont.) Small World Properties Values in parenthesis are values for a random network of the same size and average degree.
Implications of the Small World Property • The narcotics networks in both jurisdictions can be classified as small world networks. • The clustering coefficients of the networks are much larger than their random counterparts. • This suggests that criminals show the tendency to from circles of associates where members commit crimes together. • This is not unusual in narcotics networks where an individual commits crimes with friends and people in his trust circle. • This property works as an asset to law enforcement in identifying criminal conspiracies. • A short L in a narcotics network has important implications for both crime and law enforcement: • It improves the speed of flow of information and goods in the network. • It also suggests that criminals often commit crimes with individuals outside their group. This creates the shortcuts needed to reduce L. • A short average path length has positive implications for law enforcement too. Short paths between criminals generate better leads in crime investigations.
Degree Distributions Single Jurisdiction (cont.) These diagrams show the log-log plots of the cumulative degree distribution (p(k)) vs. the degree (k). The insets are p(k) vs. k. The solid line is the truncated power law curve.
Implications of the Scale Free Property • The narcotics networks in both jurisdictions can be classified as scale free (SF) networks. • This implies that a large number of individuals do not have have many associates but, a few have large number of associates. • The exponents in both power law decays are very small (0.85 – 1.3). The distribution decays slowly for lower degrees, indicating that there a large number of nodes with small degrees. • This is not unexpected as criminals with high degrees attract more attention from law enforcement authorities so having less associates is beneficial. • The truncated power law fits (R2 =93%) better than the power law distribution (R2 =85-87%) . • As the number of links (k) grows, the probability of nodes having ‘k’ links decreases. • This might indicate the cost or trust constraint (criminals may not want to attach to many people) to growth.
This curve shows the preferential attachment when the narcotics network in TPD is augmented with data from PCSD. Growth in Multiple Jurisdictions The dashed line above the curve shows a linear preferential attachment growth, the solid line shows the state of no preferential attachment.
Preferential Attachment: Implications • The curves lie above and grow faster than the solid line, offering visual evidence of the presence of preferential attachment. • Two properties of growth between jurisdictions are worth noting: • The curve maintains linearity at low value of k. The linearity breaks down for higher degrees. • In totality the lower degree nodes attract more nodes towards themselves than higher degree nodes.
Preferential Attachment: Implications (cont.) • Break in Linearity • The slow growth of nodes with high degree can be attributed to the nature of networks being studied. • Cost/Trust effect: Criminals may not prefer to be related to a large number of individuals for the risk of drawing attention. Thus, the cost of acquiring more links is high, this might prevent a node with large number of links to acquire more. • External influences: Law enforcement limits the number of crimes a individual can commit.
Preferential Attachment: Implications (cont.) • Lower degree nodes attract more nodes • The data on police incidents is drawn from two different jurisdictions. • A criminal might be committing more crimes in one jurisdiction and not the other. • Thus, one jurisdiction may have incomplete information about the activity of some criminals in the network. • These criminals will have a low degree in one jurisdiction. • On adding the second jurisdiction, the degree of these criminals increase since they commit more crimes in the second jurisdiction. • This will lead to lower degree nodes attracting more nodes than higher degree nodes.
Conclusions • This study focused on topological properties of criminal activity networks and their link to law enforcement, border and transportation security. • Criminal networks are small world networks with scale free distributions. These topological characteristics have important implications for law enforcement and hence transportation security. • A single jurisdiction contains incomplete information on criminals and cross-jurisdictional data provides an increased number of higher quality investigative leads.
Outline • Homeland Security Centers of Excellence • Information Retrieval using COPLINK • Social Network Analysis of Criminal Networks • Criminal Network Visualizer: System Demo • Mutual Information Analysis to Identify High-risk Vehicles
Mutual Information Analysis to Identify High-risk Vehicles • Vehicles involved in illegal activities (especially smuggling) may operate in groups. • If the criminal links of one vehicle in a group are known, then their border crossing patterns can be used to identify other partner vehicles. • CBP agents also suggest that criminal vehicles may cross at certain times of the day to try and evade inspection. • The concept of mutual information (MI) can be used to include these heuristics and identify high risk vehicles.
Literature Review • Association rule mining • Mutual information • Applications of mutual information
Association Rule Mining • Inferring associations between items in the database was motivated by decision support problems faced by retail organizations (Stonebraker 1993). • An association rule (AR) is a relationship of the form A B • A is the antecedent item-set and B is the consequent item-set. • The antecedent and consequent item-sets can contain multiple items. • A B holds in a transaction set D with • confidence ‘c’ if c% of transactions in D that contain A also contain B, • support ‘s’ if s% of transactions in D contain both A and B. • Association mining identifies all the rules that have support and confidence greater than user-specified thresholds.
Applications of AR Mining • AR mining has been applied in many domains including • ‘market basket’ data (Agrawal 1993, 1994), • web log analysis (Mobasher 1996), • network intrusion detection (Lee 1998), • recommender systems (Lin 2002), and • gene regulatory network extraction (Berrar 2001). • Work has also been done to include domain heuristics in AR mining with • market basket analysis (Hilderman 1998), and • gene regulatory networks (Huang forthcoming).