I recently began using DeepCrawl after putting it off for awhile. I quickly saw the advantage of not only being able to crawl much larger sites (since desktop applications begin to use up a lot of memory once your crawl exceeds a certain number of URLs). I was happy with my decision to begin using DeepCrawl and I quickly found some opportunities for on-site improvements I had previously missed. I quickly became a believer. Aside from the fact that Deep Crawl can crawl a larger number of URLs, it also integrates with Google Analytics and Google Search Console. This allows the tool to give you far deeper insights than a typical site crawler.
Upgrading to Deep Crawl 2.0
I had finally gotten my head around the existing version when I started seeing the prompts to switch to Deep Crawl 2.0. Skeptical at first, I quickly saw the advantage of switching. 1.9 (and no doubt, previous versions) was about as organized and intuitive as an Uncle Sam tax form (and I don’t mean a simple 1099). It had the feeling of being built by developer or developers who would already know where everything was located. It took a few sessions to get used to the UX.
Fortunately 2.0 has a cleaner, more intuitive layout. WordPress junkies (like myself) will feel right at home with the WP admin-like left navigation. My original hesitation to make the switch to 2.0 (after all, I had just become accustomed to 1.9) was unfounded. The learning curve was surprisingly gradual for 2.0 and I was running quality audits in no time.
Cosmetic/UX upgrades aside, the new version boasts some other features including more URLs crawled/second, powerful filters and reports, and new metrics for individual URLs.
Finding out that half of your pages are non-indexable might be a big deal (depending on the reason for this). Diving into this report will let you unpack the reasons for the non-indexable status of these URLs. Find out whether these pages are not being indexed due to canonicalization, or having a no-index, nofollow, disallow, or other robot directives and if the pages are paginated.
In the age of Panda, you don’t want to be the site with countless thin content pages. The “Min Content/HTML Ratio” section will show you which pages have a relatively low volume of content. This is a great metric for finding low contest pages to address. You can use the Deeprank and Level metrics to prioritize.
Broken Pages Driving Traffic
This is a great source for quick wins. Here you can find broken pages that people are finding. Since these pages are driving traffic you’ll want to fix them ASAP (and there is a share link function that will make it easy to share with developers or stake holders). Fix these URLs before the web admin deletes the link and/or the URL slides off the search index.
Like the “Broken Pages Driving Traffic” report/metric, this leverages the data from Google Analytics in order to find areas of opportunity. Here you can find pages with are receiving Google organic traffic but not internally linked. These pages represent quick wins and ideal candidates for internal linking.
Crawl Source Analysis Gap
This section will show pages which are being visited and yet aren’t internally linked. It will also show which pages are driving traffic. This is another great place to find ideal internal linking sportiness. It’s worth noting that in version 2 you can include up to 5 sources in any one crawl:
- Any analytics
- URL lists
To get a really thorough gap analysis to suss out any missed opportunities, I suggest incorporating sitemaps, analytics and backlinks into your website crawl. You can sync your Google Analytics account(s) and have 7 or 30 days worth of data involved. Or, you can manually upload up to 6 months of data from either GA or Omniture (or whatever analytics you use) to dive even deeper and factor in your historical data.,
While Deep Crawl isn’t the only crawler tool that allows you to find custom values, what it does do is give you 10 friends of custom extractions. You can use these to find specific instances not included in the default craw;l report, and DeepCrawl can essentially double up as a scraping tool so you can drill down and find exactly what you’re looking for. These can include Google Analytcis (or Webtrends, Omniture, Nielsen, Yahoo! Web Analytics, Statcounter, Woopra, Hitbox, etc) tracking code, missing ALT tags, schema.org markup (and other classes of rich snippets).
Huge Export! Perfect for R Studio!
Perhaps my favorite aspect of Deep Crawl is the huge data set you can export. It might seem overwhelming at first but with the right filters you can gleam some super valuable data with some filtering. While you can certainly do this in Excel (and all DeepCrawl data can be downloaded to Excel, alongside pdfs/pngs etc), the huge data set makes for a perfect excuse to this in R Studio (especially if you’re exporting hundreds of thousands or even millions of URLs).
# create a function (in this case, called "clientsiteurls") as a date frame of your DeepCrawl export.
# Create another function based on a subset that filters for rows with a certain number of reported visits in Google Analytics but low internal linking on your site.
seoopps<-subset(clientsiteurls, ga_visits >= 1000 & deeprank<=5)
# View your new "siteopps"date frame in a rows and columns format
# Write your new, filtered data frame to CSV file named "clientseowins.csv"
write.csv(siteopps, file = "clientseowins.csv")
This is just a small sample of the sort of insights you can get from DeepCrawl.