• Excel Tidbits: Extracting Domains from URLs Function

    Date: 2012.04.02 | Category: Programming & Functions | Tags: ,,,

    Excel LogoiWorks Numbers LogoI rarely show how much of a techie I am, but thought this equation was valuable to share. In some of the work I do for my company, Anthem Optimization, I mash and extract data daily using Excel. I needed a way to pull out the domain from a web address and compare it with domains from other sheets and activities. This then gives me insight into how often we do work on websites across all our projects.

    Version Note: The formula below only works in newer versions of Excel–typically versions that can read the xlsx format. Older versions of Excel (pre-2004) are limited to seven nested formulas. This equation has eight. If you need this to work in an older version, you can remove the over arching check for information in the $A6 cell at the very front (if($A6=””,…)). The iserror formulas in the if statements checks for errors and displays whatever is in the A6 cell if an error occurs. If the cell is blank, it should stay blank. I simply used the blank check upfront to make the formula more efficient in processing.

    Here is the formula:

    =IF($A6=””,””,SUBSTITUTE(IF(ISERROR(FIND(“https://”,$A6)),SUBSTITUTE(IF(ISERROR(FIND(“/”,SUBSTITUTE($A6,”http://”,””))),$A6,REPLACE($A6,FIND(“/”,SUBSTITUTE($A6,”http://”,””))+7,LEN($A6)+1,””)),”http://”,””),SUBSTITUTE(IF(ISERROR(FIND(“/”,SUBSTITUTE($A6,”https://”,””))),$A6,REPLACE($A6,FIND(“/”,SUBSTITUTE($A6,”https://”,””))+8,LEN($A6)+1,””)),”https://”,””)),”www.”,””)

    A bit complex to explain, but it considers situations where a web address uses either the secure “https” protocol or just the regular “http” and always removes the “www” from the address. The rest of it is looking for the first slash after the “http://” or “https://” and deletes everything from that slash to the end.


    Here are the details for each stage:

    =IF($A6=””,””,  // Makes sure nothing happens if nothing is in the cell

    SUBSTITUTE(  // Used to delete www from the result

    IF(ISERROR(FIND(“https://”,$A6)), // If it doesn’t find https, then execute the function to remove everything after the domain for sites with http:// — this is needed since the s adds characters prior to the slash (important for the replace function later)

    SUBSTITUTE( // If https:// was not found, this begins the formula used to delete everything after the first slash after the http://

    IF(ISERROR(FIND(“/”, SUBSTITUTE($A6,”http://”,””))), // If it doesn’t find a slash after the domain, then do nothing and display what’s in the cell

    $A6,      // Just display what was in the cell–which should be just a domain address since there is no slash after the http

    REPLACE($A6, // This is the core function that removes everything after the domain

    FIND(“/”, SUBSTITUTE($A6,”http://”,””))     // Finds the spot/character number where the first slash is after the http://

    + 7,        // Adds the number of spaces for http so that the starting point of the replacement is correct

    LEN($A6)+1,       //This finds the end of the space where text is supposed to be replaced/deleted. Added 1 to cover bases and potential spacing issues.

    “”          // this is what it will be replaced with…nothing.

    )          // Ends the Replace function.

    ),        // Ends the IsError function.

    “http://”,””    //  Once the result is determined (either after the end of the url is cut off or if it’s just a domain), delete the http:// so that just the domain shows

    ),      // Ends the substitute function if http:// is found.

    SUBSTITUTE( // If https:// was not found, this begins the formula used to delete everything after the first slash after the http://

    IF(ISERROR(FIND(“/”, SUBSTITUTE($A6,”https://”,””))), // If it doesn’t find a slash after the secure domain, then do nothing and display what’s in the cell

    $A6,      // Just display what was in the cell–which should be just a domain address since there is no slash after the https://

    REPLACE($A6, // This is the core function that removes everything after the secure domain

    FIND(“/”, SUBSTITUTE($A6,”https://”,””))     // Finds the spot/character number where the first slash is after the https://

    + 8,        // Adds the number of spaces for https so that the starting point of the replacement is correct

    LEN($A6)+1,       //This finds the end of the space where text is supposed to be replaced/deleted. Added 1 to cover bases and potential spacing issues.

    “”          // this is what it will be replaced with…nothing.

    )          // Ends the Replace function.

    ),        // Ends the IsError function.

    “https://”,””    //  Once the result is determined (either after the end of the url is cut off or if it’s just a domain), delete the https:// so that just the domain shows

    ),      // Ends the substitute function if https:// is found.

    ),

    “www.”,””   // so whatever the result is, it should now just be a domain, but sometimes links have www for a site and sometimes not. So, we’ll delete the www if it’s there.

    )


    There are many great applications for this in search engine optimization (SEO), online marketing, publicity, analytics, web development, and many more areas where reporting and comparing data is helpful! Combine this with a VLOOKUP or SUMIF formula and you will be able to total up visits, backlinks, status, errors, and much more.