Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter Bot Traffic #13

Open
1 task done
amiryselim opened this issue Mar 26, 2023 · 0 comments
Open
1 task done

Filter Bot Traffic #13

amiryselim opened this issue Mar 26, 2023 · 0 comments

Comments

@amiryselim
Copy link

Contact Details

amir@tradeblock.us

Language

Python

Category

Data Security & Governance

Description

About

Context

  • "over 40% of all Internet traffic is comprised of bot traffic" - Cloudflare
  • Improves data quality for popular destinations like Amplitude that don't filter bot traffic
  • This inability to filter bot traffic has been raised in several forum posts and even prompted a user to develop an external application for whitelisting and forwarding traffic to Amplitude

Code Block

def transformEvent(event, metadata):
    lower_user_agent = event["context"]["userAgent"]
    known_bot_filter_list = ['bot', 'crawler', 'spider', 'feedfetcher-google', 'mediapartners-google', 'apis-google', 'slurp', 'python-urllib', 'python-requests', 'aiohttp', 'httpx', 'libwww-perl', 'httpunit', 'nutch', 'go-http-client', 'biglotron', 'teoma', 'convera', 'gigablast', 'ia_archiver', 'webmon ', 'httrack', 'grub.org', 'netresearchserver', 'speedy', 'fluffy', 'findlink', 'panscient', 'ips-agent', 'yanga', 'yandex', 'yadirectfetcher', 'cyberpatrol', 'postrank', 'page2rss', 'linkdex', 'ezooms', 'heritrix', 'findthatfile', 'europarchive.org', 'mappydata', 'eright', 'apercite', 'aboundex', 'summify', 'ec2linkfinder', 'facebookexternalhit', 'yeti', 'retrevopageanalyzer', 'sogou', 'wotbox', 'ichiro', 'drupact', 'coccoc', 'integromedb', 'siteexplorer.info', 'proximic', 'changedetection', 'cc metadata scaper', 'g00g1e.net', 'binlar', 'a6-indexer', 'admantx', 'megaindex', 'ltx71', 'bubing', 'qwantify', 'lipperhey', 'addthis', 'metauri', 'scrapy', 'capsulechecker', 'sonic', 'sysomos', 'trove', 'deadlinkchecker', 'slack-imgproxy', 'embedly', 'iskanie', 'skypeuripreview', 'google-adwords-instant', 'whatsapp', 'electricmonk', 'yahoo link preview', 'xenu link sleuth', 'pcore-http', 'appinsights', 'phantomjs', 'jetslide', 'newsharecounts', 'tineye', 'linkarchiver', 'digg deeper', 'snacktory', 'okhttp', 'nuzzel', 'omgili', 'pocketparser', 'um-ln', 'muckrack', 'netcraftsurveyagent', 'appengine-google', 'jetty', 'upflow', 'thinklab', 'traackr.com', 'twurly', 'mastodon', 'http_get', 'brandverity', 'check_http', 'ezid', 'genieo', 'meltwaternews', 'moreover', 'scoutjet', 'seoscanners', 'hatena', 'google web preview', 'adscanner', 'netvibes', 'baidu-yunguance', 'btwebclient', 'disqus', 'feedly', 'fever', 'flamingo_searchengine', 'flipboardproxy', 'g2 web services', 'vkshare', 'siteimprove.com', 'dareboost', 'feedspot', 'seokicks', 'tracemyfile', 'zgrab', 'pr-cy.ru', 'datafeedwatch', 'zabbix', 'google-xrawler', 'axios', 'amazon cloudfront', 'pulsepoint', 'cloudflare-alwaysonline', 'google-structured-data-testing-tool', 'wordupinfosearch', 'webdatastats', 'httpurlconnection', 'outbrain', 'w3c_validator', 'w3c-checklink', 'w3c-mobileok', 'w3c_i18n-checker', 'feedvalidator', 'w3c_css_validator', 'w3c_unicorn', 'google-physicalweb', 'blackboard', 'bazqux', 'twingly', 'rivva', 'dataprovider.com', 'theoldreader.com', 'anyevent', 'nmap scripting engine', '2ip.ru', 'clickagy', 'google favicon', 'hubspot', 'chrome-lighthouse', 'headlesschrome', 'simplescraper', 'fedoraplanet', 'friendica', 'nextcloud', 'tiny tiny rss', 'datanyze', 'google-site-verification', 'trendsmapresolver', 'tweetedtimes', 'gwene', 'simplepie', 'searchatlas', 'superfeedr', 'freewebmonitoring sitechecker', 'pandalytics', 'seewithkids', 'cincraw', 'freshrss', 'google-certificates-bridge', 'viber', 'evc-batch', 'virustotal', 'uptime-kuma', 'feedbin', 'snap url preview service', 'ruxitsynthetic', 'google-read-aloud', 'mediapartners', 'wget', 'wget', 'ahrefsgot', 'ahrefssiteaudit', 'wesee:search', 'y!j', 'collection@infegy.com', 'deusu', 'bingpreview', 'daum', 'pingdom', 'barkrowler', 'yak', 'ning', 'ahc', 'apache-httpclient', 'buck', 'newspaper', 'sentry', 'fetch', 'miniflux', 'validator.nu', 'grouphigh', 'checkmarknetwork', 'www.uptime.com', 'mixnodecache', 'domains project', 'pagepeeker', 'vigil', 'php-curl-class', 'ptst', 'seostar.co']
    if any([bot_name in lower_user_agent for bot_name in known_bot_filter_list]):
        return
    else:
        return event

Input Payload for testing

[
  {
    "anonymousId": "8d872292709c6fbe",
    "channel": "mobile",
    "context": {
      "app": {
        "build": "1",
        "name": "AMTestProject",
        "namespace": "com.rudderstack.android.rudderstack.sampleAndroidApp",
        "version": "1.0"
      },
      "device": {
        "id": "8d872292709c6fbe",
        "manufacturer": "Google",
        "model": "AOSPonIAEmulator",
        "name": "generic_x86_arm",
        "type": "android"
      },
      "library": {
        "name": "com.rudderstack.android.sdk.core",
        "version": "1.0.2"
      },
      "locale": "en-US",
      "network": {
        "carrier": "Android",
        "bluetooth": false,
        "cellular": true,
        "wifi": true
      },
      "os": {
        "name": "Android",
        "version": "9"
      },
      "screen": {
        "density": 420,
        "height": 1794,
        "width": 1080
      },
      "timezone": "Asia/Kolkata",
      "traits": {
        "address": {
          "city": "Kolkata",
          "country": "India",
          "postalcode": "700096",
          "state": "West bengal",
          "street": "Park Street"
        },
        "age": "30",
        "anonymousId": "8d872292709c6fbe",
        "birthday": "2020-05-26",
        "createdat": "18th March 2020",
        "description": "Premium User for 3 years",
        "email": "identify@test.com",
        "firstname": "John",
        "userId": "sample_user_id",
        "lastname": "Sparrow",
        "name": "John Sparrow",
        "id": "sample_user_id",
        "phone": "9112340345",
        "username": "john_sparrow"
      },
      "userAgent": "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.268 [ip:5.45.207.144]"
    },
    "event": "Product Clicked",
    "integrations": {
      "All": true
    },
    "messageId": "1590431830915-73bed370-5889-436d-9a9e-0c0e0c809d06",
    "properties": {
      "revenue": "30",
      "currency": "USD",
      "quantity": "5",
      "test_key_2": {
        "test_child_key_1": "test_child_value_1"
      },
      "price": "58.0"
    },
    "originalTimestamp": "2020-05-25T18:37:10.917Z",
    "type": "track",
    "userId": "sample_user_id"
  }
]

License

  • I understand, that my code will be licensed under MIT license (copy of license is available in this repo)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant