Primer on 'Dorking' - Benjamin Gaines

#data #smb #_2022 #bizdev_utils 'Dorking' is the term used for using lesser-known features on Google and Bing to mine data. In this case, it would be used to mine for businesses whose sites contained words of interest and, if possible, the contact info for those businesses. Rather than discuss 'dorking' on an abstract level, however, lets pick a specific use case and walk through it. In this example, lets assume we want to 'dork' a list of email addresses which meet some basic criteria: - Has profiles on: `[Instagram, TikTok]` - Email domains: `[gmail.com, yahoo.com]` - Locations: `[Charlotte, Raleigh]` - Contains Key Words: `[Dentist, Chiropractor]` The google search phrase we'd use to get one of these combos would be: ``` site:instagram.com "@gmail.com" AND "Charlotte" AND "Dentist" ``` You could do this manually on the Google website, but it would take a long time. The results are limited to 10 per page, so you'd have to click 'next' a zillion times. Even then you'd have to browse the text for each of the results, look for email addresses, and copy and paste them into a file. So we'll instead write a program to do this for us. The program will run through all possible search combinations we provide it and: - 'Click' through _all_ results (usually 300-600 links) - Extract the text from those links - Extract the email addresses from that text You'd end up with something like this (with many, many more rows): ``` id, url, email, text 0, https://www.instagram.com/road_to_rainadee_dentistry/, Hellorainadee@_gmail.com, Hellorainadee@_gmail.com. 108 posts. 5,417 followers. 659 following ... Chelsea | Student _Dentist_ ... _Charlotte_ | Dental Student. 1, https://www.instagram.com/opera.dental/, opera.dental.insta@_gmail.com, Professional opera singer turned dental student Ithaca College Harvard Penn Dental opera.dental.insta@_gmail.com_ ``` From here, we'd simply just expand the inputs provided to the program and re-run. So perhaps something like: - Has profiles on: `[Instagram, TikTok, Facebook, LinkedIn, Google Business, etc.]` - Email domains: `[gmail.com, yahoo.com, hotmail.com, *.com, etc.]` - Locations: `[Charlotte, Raleigh, Wilmington, Greenville, etc.]` - Contains Key Words: `[Dentist, Chiropractor, Massage Therapist, Acupuncture, etc.]` _Note: `*` is the 'wildcard' meaning 'any'_ This approach yields actionable information in and of itself (in the form of a mass email list). You could always expand upon this corpus of data, however. A few top of mind thoughts on how to do so would be: - Scraping their websites and parsing the text for phone numbers, additional names, key words, etc. - Checking company and owner names against review sites for more info (Yelp, Google, Angie's list, etc.) - Checking company and owner names against government databases to see if they have performed Govt contracts before (excellent way to infer company size and revenue) 'Dorking' is a fairly common practice for businesses, specifically data and sales teams within businesses. Usually, I'm over-the-top against things like this, because I'm a strong privacy advocate. In this case, however, I'm split. I would have to imagine that if I were a Dentist in this example who desired a strong internet presence, I'd be fine with someone scraping my sites for marketing purposes. On the other hand, if I were just someone caught up on this because I mentioned the word 'dentist' in a random Instagram post, I'd feel surveilled. But frankly, I'm also slightly unsympathetic towards persons in the latter case. After all, why would anyone use social media at this point? Is there anyone left in the country who doesn't recognize how skeezy and low-brow social media companies are? It doesn't seem like one could reasonably expect to use social media and have privacy. Side note... here is a helpful script for 'dorking' a term across the USA, a helpful tip for market size estimates, etc. Uses the SERP API and requires a list of locations and key words. Just loops through them both and saves the results to SQLite. ```python # dork.py import os import sqlite3 import pandas as pd from dotenv import load_dotenv from serpapi import GoogleSearch # ------------------------------------------------- # # Get .env variables and set global variables # ------------------------------------------------- # # load env load_dotenv() # get serp_api key SERP_API = os.getenv('SERP_API') # ------------------------------------------------- # # Read-in and Prepare Lists of Terms and Locations # ------------------------------------------------- # # read in df and prep search term lists from it terms_df = pd.read_csv('dorking_terms.csv') # get locations search_city_list = list(terms_df['city']) search_state_list = list(terms_df['state']) search_country_list = list(terms_df['country']) search_term_location_list = [] for i in range(len(search_city_list)): search_term = search_city_list[i] + ', ' + search_state_list[i] + ', ' + search_country_list[i] search_term_location_list.append(search_term) # get terms all_search_term_list = list(terms_df['search_term_list']) search_term_list = [item for item in all_search_term_list if not (pd.isnull(item)) == True] # ------------------------------------------------- # # Function to scrape the results into DB # ------------------------------------------------- # def dork(business_search_terms:list, location_search_terms:list): # ------------------------------------ # # Create a variable to track progress # ------------------------------------ # current_count = 0 total_num_of_searches = len(business_search_terms)*len(location_search_terms) # ------------------------------------ # # Make the db # ------------------------------------ # db = sqlite3.connect('dorking.db') cursor = db.cursor() cursor.close() db.close() # ------------------------------------ # # Make dataframes and insert them to db # ------------------------------------ # for term in business_search_terms: for location in location_search_terms: try: # print check print(f'{term} | {location}') # define lists for the data we want query_list = [] search_location_list = [] title_list = [] place_id_list = [] address_list = [] phone_list = [] latitude_list = [] longitude_list = [] website_list = [] # assign params params = { "q": term, "location": location, "tbm": "lcl", "api_key": SERP_API } # conduct search search = GoogleSearch(params) results = search.get_dict() local_results = results['local_results'] # parse the local results and add desired info to our lists for result in local_results: query_list.append(params['q']) search_location_list.append(params['location']) # title try: title_list.append(result['title']) except: title_list.append('null') # place try: place_id_list.append(result['place_id']) except: place_id_list.append('null') # address try: address_list.append(result['address']) except: address_list.append('null') # phone try: phone_list.append(result['phone']) except: phone_list.append('null') # latitude try: latitude_list.append(result['gps_coordinates']['latitude']) except: latitude_list.append('null') # longitude try: longitude_list.append(result['gps_coordinates']['longitude']) except: longitude_list.append('null') try: website_list.append(result['links']['website']) except: website_list.append('null') # make a df from the lists with the info we want zipped = list(zip(query_list, search_location_list, title_list, place_id_list, address_list, phone_list, latitude_list, longitude_list, website_list )) df = pd.DataFrame(zipped, columns=['query', 'search_location', 'business_name', 'business_id', 'address', 'phone', 'latitude', 'longitude', 'website' ]) # upload df to db conn = sqlite3.connect('dorking.db', check_same_thread=False) df.to_sql(name='dorking', con=conn, if_exists='append', index=False) # print a progress statement print(f'Executed {current_count} of {total_num_of_searches} searches') current_count += 1 except: # print a failure statement and continue print(' ') print(f'FAILURE ON {term} | {location}') print(' ') current_count += 1 # ------------------------------------------------- # # Execute the dorking function # ------------------------------------------------- # dork(search_term_list,search_term_location_list) ```