430 likes | 460 Views
How is our beloved language doing? Analyzing all of GitHub, StackOverflow and Hacker News via the publicly available Google BigQuery datasets (some 40TB of data) this presentation aims to give humorous and ingenious insights into our TIOBE Index Top 10 language. If you’ve ever wondered which PHP versions are still in use, which packages are most widely used or if you’re the only ‘Full StackOverflow Developer’ this is the presentation for you. Similarly we’ll look into things like PSR adoption (who’s still using tabs), framework popularity and community participation with a view to how we are doing and where we should be going. Trolls are welcome – I have the data…
E N D
A state of PHP 2018 Backed by Data PHP South Africa 2018 Brad Mostert /bsinkwa /mostertb
What? • Analyse Big Data on GitHub • Give some insight on our Community and Craft • Be Interesting (or atleast funny) • Influenced heavily by the work of Felipe Hoffa (@felipehoffa) from Google • Lies, Damn Lies and Statistics
WHO AM I Senior developer at Afrihost Server Shepard PHP Joburg Organizer Crazy! Gave the Advanced Composer workshop and Design Patterns in PHP talk last year
Data Sources: Github Contents *Statistics compile by me this week • BigQuery Public Dataset • cloud.google.com/bigquery/public-data/github* • 3.5TB+* • 3.4 Million Projects • 222 Million Commits • 2.3 Billion Unique File Paths • Latest Revision of 245 Million Files (RegEx Searchable) • Update ~Weekly
Data Sources: Github Contents • To Be Included • Public Repo • Clear OpenSource License • Detected by GitHub API • developer.github.com/v3/licenses/ • ASCII Files Less than 10MB • Mostly non-forked • Excludes “Un-notable” projects
Data Sources: GHTORRENT ghtorrent.org • Watch GitHub Public Event Timeline • api.github.com/events • Exhaustively Retrieve Related Information from GitHub Knowledge Graph • Data Since 2012 • MongoDB • Raw JSON Representations • 10TB+ • MySQL • Links Dependencies between Data
Data Sources: GHTORRENT – Mysql data • Updated Monthly • Most recent version in BigQuery from April 2018 • Manually Imported 2018-09-01 dump • 291GB in CSV • 98+ Million Repos • 89 Million Excluding deleted Repos and Users • 1+ Billion Commits
Data Sources:GH Archive gharchive.org • Also queries the GitHub Events API • Stores only events in BigQuery Tables • bigquery.cloud.google.com/table/githubarchive:day.yesterday • Records contain both common fields (like Repo Name) and the full JSON Payloads • Broken up into tables Per Day, Month Year • Updated Hourly • Raw JSON also available for download • Size • Day: 2.7k tables Total 3.636TB • Times 3 for Month and Day
Tools: Google BigQuery • Highly scalable, fully managed data warehouse and analytics platform • Part of Google Cloud Platform • Distributed Columnar Database (Dremel) • Billed on ‘amount of data processed’ and storage • $5 per 1TB processing • 1TB processing free per month + 10GB storage • $300 Free Tier Trial (cloud.google.com/free) Free Trial 0 Free Trial 1 Free Trial 2 github.com/mostertb/state-of-php-2018-scratch Free Trial 3
Tools: HomeLaB • Personal Server: pre-processing the GHTorrent Data • Dell R720 • 128GB RAM • Duel Xeon E5-2670 @ 2.60GHz • 3x 300GB 15000k SAS + 480GB SSD • I’ll give you access to my GHTorrent Datasets MyISAM actually works well for this application
Number Projects: GHTORRENT • Total in GHTorrent: 98 Million • Less Deleted Repos and Users: 89 Million • Octoverse 2017 Report has 67 Million • Empty Repos? Forks without changes? • With any PHP: 2.5 Million • Non-forked with any PHP: 1.2 Million • >10KB Code and ‘PHP Bytes’ > 0: 939,895 • Reported by github/linguist (No vendor, docs or generated) • Same criteria over all GHTorrent Repos: 7.9 Million
11.91% of unique, non-trivial projects on github involve PHP 939,895 / 7,892,367 = 11.91 % • Based on GHTorrent Data • Not Deleted • Not Forked • More than 10KB non-vendor / non-generated Lies, Damn Lies and Statistics…
PHP Repo Events over past year All branches. Includes pushing tags ‘Staring’ a repo. Doesn’t include ‘un-staring’ Anything to do with a PR (assigned, unassigned, labeled, unlabeled, opened, edited, closed, reopened) Create repository, branch, or tag • GH Archive • Between September 2017 and August 2018 • Non-forked Repos • Repo Size >10KB New Release published Private Repo becomes Public
PHP Projects Active over Last Year 161,896 Projects out of 939,895 • Events: • PushEvent • WatchEvent • PullRequestEvent • CreateEvent • ReleaseEvent • Non-forked, non-deleted • > 10KB non-vendor/non-generated code 17.22 % compared to 20.96% over all repos
Languages used with PHP • C and C++ in 11th and 12th in both cases • Smarty is still at ~1% • Vue gains 2.57% • Hack gains 0.64% putting it in 16th on Active • Dockerfile ranks 61st at 0.13% • Java sees no percentage change • Perl drops down to 14th All Active in Last Year
Languages where PHP is Primary Active in Last Year >=90% Bytes PHP
PHP Project Owner location 37.84% of owners provide a geo-codable location
GitHub Contents • 3,353,813 Projects Total • 344,215 Projects with any PHP • 290,206 PHP Projects >= 10KB Detected Code • To Be Included • Clear OpenSource License • ASCII Files Less than 10MB • Mostly non-forked • Excludes “Un-notable” projects
Github contents: composer files • Only in the root directory • Only on master branch • All PHP projects included in the BigQuery Public Datasset Total Files: 152,188 In Root Path: 150,899 Master Branch: 144,044
Github contents: composer Packages 34,510 Distinct Packages
Github contents: composer Packages Other Favorites Rather use ‘require-dev’ Easy pull request?
Composer Packages: Framework Ranking • Only considering BigQuery GitHub Public Dataset • Not taking into account activity or size • Not taking into account versions
Composer Packages by Vendor 12,213 Unique Vendors ~35% Unique
Composer Packages: Require-Dev Top 10 Notable
Composer Packages: MINIMUM PHP VERSION • Naively matched: LIKE ‘%<version>%’ • 840 unmatched values • 402 only provide major version 7 • 105,948 Composer Files provide a PHP version
Contents: PHP Projects • 344,215 with any PHP • 290,206 >= 10KB • 28,778 In the List of GHTorrent Active
Contents • 2,896,713 PHP Files (<=10MB) • 16.66 GB
TABS vs Spaces Credit to Felipe Hoffa • Files at least 10 lines • 413,175,973 Analysed • 649,150 Tab Files • 1,748,235 Space Files
Full Stack overflow developers “Stack Overflow” (and variations) occurs on 4694 lines in 3892 in PHP projects in the dataset updated in the last year These are just the examples with atttibution
PHP 7 Language Features • Files with: • Spaceship Operator (<=>): 1,869 • “yield”: 3951 • (something,); : 110556
Questions? PHP South Africa 2017 Brad Mostert /bsinkwa /mostertb
github.com/mostertb/phpsa-2018-profiles /bsinkwa /mostertb