CS5525 Data Analytics Crime Data Mining

CS5525 Data Analytics Crime Data Mining Wei Wang, Yi Xiao, ZhenHe Pan, Fang Jin

Outline • Project Motivation • Challenge • Our Approach • Visualization • Conclusion

Project Motivation • What is the association among crime category ? • Who drag the crime rate lower or higher ? • How the crime distribute spatially, temporally ? • How are the correlation going among all crime events ? • Which criminal gang are hang around DC ?

Challenges • Dataset problem • Wrong format date • Incomplete description • Disorder encoding date source • Lack of detailed timestamp, stay on day stage • Huge records require optimal algorithm to detect the association rule • How to extract the valuable information from the description? • How to find effective algorithm to detect similarity, within huge various records, eg: Robbery, larceny, homicide, arson have different format description.

Our Approach • Time Series analysis • Text Mining • Association Rule • Similarity Analysis

1) Time Series analysis • Months • Weekday and weekend • Seasons • Outside factors, unemployment

1) Time Series Analysis

2) Text Mining 1. Delete stopwords, such as “a”,”the”, “and”,”on”. 2. Count word frequency, to get the top frequent words which support count is higher than 50 3. define a training set category, like: • Time: AM, PM, days in a week, month • Weapon: gun, knife, • Cloths: hoodie, T-shirt, cap • Color: black, red • Age: teenager, old, • Car brand: Toyota, BMW • Wounded: • Action: 4. From the seeds, to expand the feature list by extracting the nearby words 5. Add feature list by analyzing crime news from website

2) Text Mining 6. Filter the crime description text using those feature library to get the effective words. 7. Set the description text length threshold, e.g 30 effective words, which means if the text length is below this threshold, we think this text provides very general or small information about this criminal event. In this case, we will ignore this criminal event completely. 8. Compare any two criminal description words, whose length ratio should not greater than 20, and to find out the same words. If the number of the same words are more than 5, we compute its similarity. Otherwise, we abandon those two criminal description texts, and consider them are totally independent events. 9. Compute the similarity of each criminal events as out confidence.

2) Text Mining

3) Association Rule • Goal: Explore the association rules among different crime type. • Algorithm Apply Apriori algorithm, support threshold = 0.5 Normalize the transaction, treateach day as a basket of crime set, eliminate the low support crime events • Results: Burglary has a strong relationship with assault offenses and robbery. Each time an assault offense occurs, burglary will also happen.

3) Association Rule

4) Similarity Analysis Similarity Analysis in different dimensions 1. Records normalization based on properties • Temporal: Day of week, Day of Month, AM or PM • Spatial: latitude, longitude Using Haversine formula to compute distance between two Location • Category: URC_category, sub_category • Textual: TF – IDF

4) Similarity Analysis Similarity Analysis in different dimensions 2. Similarity computing Similarity = Wt* St + Ws*Ss + Wc*Sc + Wd * Sd St = 1/3 *[diff(day of week)>2] ? 0:1 + 1/3 *[diff(day of month)>3] ? 0:1 + 1/3 * [diff(phase)] Ss = [25 – Haversine(Lon1,Lat1,Lon2,Lat2)] / 25 Sc = ½ * [diff(Urc)] + ½ *［dfii(sub_category)］ Sd = cosine(|D1|,|D2|)

Visualization • Crime Distribution Revealed on Map • Crime Listing and Searching • Similarity of Crimes

Spatially Marker Clustering • Why Cluster? Too crowed

Spatially Marker Clustering The largest cluster size The second largest cluster size The third cluster size The fourth cluster size The fifth cluster size Single crime event

Spatially MarkerClustering

Visualization

Conclusion • What is the association among crime category ? They do have high confidence among crime categories, eg: Assault offense  burglary • Who drag the crime rate lower or higher ? Certain crime category have their own rules, for example: Arson are more likely to happened on October. Burglary are higher on Monday, while Arson are higher on Wednesday. Homicide are easily to happen during summer, especially higher on Saturday.

Conclusion • How the crime distribute spatially, temporally ? Dc has the most crime events, account 76%, the second is Fairfax 14%, so the two countries should have more police. Alexandria public order is getting better, while Arlington is getting worse. DC keeps the same distribution. • How are the correlation going among all crime events ? Which criminal gang are hang around DC ? From the high confidence crime similarity, we can find hint of the same criminal gang.

CS5525 Data Analytics Crime Data Mining