310 likes | 496 Views
Investigating Network Access Permissions in Android Apps . By: Ed Zulkoski. Permissions. Android apps must request permissions for certain features on the device. Permissions are published with the app, and users may choose to not install “suspicious” apps.
E N D
Investigating Network Access Permissions in Android Apps By: Ed Zulkoski
Permissions • Android apps must request permissions for certain features on the device. • Permissions are published with the app, and users may choose to not install “suspicious” apps. • Many apps require network and internet access, but do not always say why.
Why does this app need internet? • Facebook
Why does this app need internet? • Words With Friends
Why does this app need internet? • Super Mario Live Wallpaper
Security Risk of Apps • > 800,000 apps in Google Play Store. • > 7 billion app downloads in 2009, • $4.1 billion. • In a 2012 study1 of over 400,000 Android apps, over 100,000 were classified as potential security risks. • “26 percent of apps access private information such as email and contacts, with only 2 percent of apps being from highly trusted publishers.” 1 - Bit9: "Pausing Google Play: More than 100,000 Android Apps May Pose Security Risk"
Goal • Determine why an app needs network and internet permissions. • Ideal – learn exactly why an app needs internet access • Probably unrealistic • Subgoal – detect if an app uses an ad library
Dataset • 281,079 free apps from the Google Play Store from 2011. • Contains multiple versions of some apps. • Does not have labels indicating if an app uses an ad service (unsupervised learning).
Features • The .apk file. • The Android manifest file
What we don’t have • It would be useful to have paid apps that have a corresponding free app. • Many paid apps remove ads.
Feature Construction • Simple keyword search in app description • Ad, Advertisement, AdMob, etc. • Ivy Leaf Wallpaper Summary: • Problem: “This app is provided to you for free and without AdMob ads.” “FAQ:1. Why is there "Internet access" permission?It is for Google ads on setting screen only, nothing else. Pro version is adsfree with more features.”
Feature Construction • Similar keyword search in Android manifest. • Intents and Activities <activity android:name="com.google.ads.AdActivity” …>
Preprocessing • Remove duplicate app “snapshots.” • Take the latest version (from Alex’s brainstorming session). • Remove any apps without INTERNET and ACCESS_NETWORK_STATE permissions.
Why use ML? • Why not just search for specific activity names in the manifest file? • AdMob: <activity android:name="com.google.ads.AdActivity" … > • Many ad services with different requirements: • Admob, Millennial Media, MobClix, Tapjoy, AdWhirl, Greystripe, InMobi, Airpush, Startapp, Leadbolt, Pontiflex, MobFox, Komli Mobile, MoPub, MdotM, inneractive, Adlantis, Smaato, Daum, AppLift, Mediba, Cauly, YouMi, AdMarvel, madvertise, Sellaring, etc.
Approach 1 • Clustering • Perform supervised learning using the cluster as a feature on a subset of apps. • Still need to know whether these apps use ad services (time consuming).
Approach 2: Active Learning • We don’t want to hand label 200,000+ apps. • A machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. • The learner selects a data instance that it wants to know the value of and queries an oracle (e.g. a human annotator).
Pool-based Active Learning Does this app use ads? Burr Settles: "Active Learning Literature Survey"
Query Strategy • Method to determine the informativeness of an unlabeled instance. • Uncertainty Sampling – choose the instance least certain about. • Example: binary classification – choose instance closest to 0.5 for the current model.
Pool-Based Uncertainty Sampling • Seed the learner with known instances.
Pool-Based Uncertainty Sampling • Get the value of the instance least certain about.
Pool-Based Uncertainty Sampling • Get the value of the instance least certain about.
Pool-Based Uncertainty Sampling • Get the value of the instance least certain about.
Pool-Based Uncertainty Sampling • Get the value of the instance least certain about.
Toy Pool-Based Example Burr Settles: "Active Learning Literature Survey"
Text Classification Example Burr Settles: "Active Learning Literature Survey"
Approach 2 • Start with a small set of labeled apps. • Use pool-based active learning (with an underlying logistic regression model) to select new apps to query. • Tell the learner the correct label for the query. • Repeat until I am tired? (discussion point) • Or the model has stabilized.
Conclusion • Many apps require internet access, but the app’s true intentions may be unknown. • It would be useful to determine why apps require these permissions. • Use pool-based active learning to approach this problem.
Discussion • What is the best way for evaluating performance? • How to handle “skewed” data • Possibly many more free apps with ads than without. • High number of apps using AdMob. • When do I stop labeling new instances?