WebHarvy 5.3 (Parallel Mining, Chrome Developer Tools)

How to increase mining speed ?‘ was one of the most commonly asked questions by our users. With previous versions, the main limitation was that when links had to be followed from the starting page to get each listing details, the miner took more time to scrape a page full of listings. This is because WebHarvy used to sequentially load links one after the other to scrape data.

Parallel Mining

Instead of processing links to be followed and extracted one after the other, the latest update of WebHarvy processes them in bulk, in parallel, using multiple mining threads. You can set the maximum number of parallel mining threads which WebHarvy uses in Advanced Miner Options window as shown below.

Providing a higher value for ‘Maximum number of parallel mining threads’ option in the above window will increase mining speed. But, to run more threads in parallel, WebHarvy will require more memory, processing power and  internet-bandwidth. So we recommend that you increase this setting only based on your system’s CPU, installed physical memory (RAM) and internet speed.

Chrome Developer Tools

This feature is for power users who are familiar with web page internals like HTML, DOM structure and JavaScript. We use this tool extensively while supporting our customers with not so straightforward scraping scenarios and complex websites.

Chrome Developer Tools allow you to easily inspect the internal structure of a web page, see how the page is organised, view the HTML and data hidden in HTML source and devise methods to extract them. You can also find the JavaScript code run when buttons/links are clicked and directly call them using these features.

More Accurate Automatic Sub-Text Selection

To scrape only a portion of the text displayed in the Capture window, you can highlight the required portion with mouse. We have improved the accuracy of this method, especially when the text selected is in between delimiter characters like currency symbols, punctuation/special characters, new line/space etc.

Improvements And Bug Fixes

  1. Improved select dropdown option. This option now reflects the selection (selected item change) on the page. Earlier separate JavaScript code needed to be run by the user to reflect page change upon dropdown list selection.
  2. Miner now scrolls the page before clicking on Load More links. This is done to make sure that the ‘load more’ link is visible and loaded before miner tries to click it.
  3. When text scaling in Windows is not set to 100% (which is the recommended setting on most systems), it was not possible to click and correctly select the required data items during configuration. This issue is fixed in this version. Configuration time data selection works irrespective of text scaling.
  4. Fixed issue related to downloading images behind SSL.
  5. Non-visibility of miner window in multi monitor systems when monitor configuration changes is fixed.
  6. Earlier, the Capture window would become unresponsive for a second or two after applying Regular Expression on HTML. This unresponsive state has been removed.
  7. Added browser zoom level and number of parallel mining threads info in status bar of configuration browser.
  8. Fixed issue with loading and displaying upgrade purchase page in cases where user’s license has expired.
  9. Disabled ‘Mine all pages/Number of pages to mine’ controls while mining is in progress.
  10. Updated internal browser to a more recent version of Chromium.
Posted in Release update, WebHarvy | Tagged , , | Leave a comment

WebHarvy’s new blog at blog.webharvy.com

We are moving all posts related to WebHarvy from our company blog here to WebHarvy’s own dedicated blog at www.webharvy.com/whblog .  All new articles, release updates, tips and tricks and case studies related to web scraping using WebHarvy will published at WebHarvy Blog.  So please make sure that you subscribe to and bookmark the new blog.

http://webharvy.com/whblog/

  http://webharvy.com/whblog/feed/

 

Posted in WebHarvy | Tagged , | Leave a comment

Follow us to stay updated !

Follow us on YouTubeTwitter and Facebook to stay up to date on latest WebHarvy features and web scraping techniques.

We will be regularly posting screen-cast videos, how-tos and release updates on these channels. So subscribe, follow and like 😊︎

a1b363c8-c5ff-40d1-a9ed-476c88fbdc35

 

https://www.facebook.com/webharvy/

https://twitter.com/webharvy

https://www.youtube.com/user/sysnucleus

Posted in Uncategorized | Leave a comment

WebHarvy’s new user interface

We have significantly updated the user interface of WebHarvy in the latest version available in our website and the following video explains how the features and options are laid out in the new UI. Existing users of older versions will find this video useful so that they know where to look for specific features and options.

Posted in WebHarvy, WebHarvy Feature | Tagged , , | Leave a comment

WebHarvy 5.2 | UI revamp + Oracle db support

Changes in 5.2 are mainly related to user interface and experience. The most visible change is the introduction of the ribbon menu system for providing easy access to most software features.

1.png

In addition to the main interface, other windows like Scheduler / Export etc. have also been updated. The export functionality (to file or database) has now been made cancel-able. User can now cancel an ongoing export to file or database.

As with every release, the Chrome browser has been updated as well. Issues related to URL update (in address bar) while navigating links in some websites has been fixed with this update.

An important non-UI feature addition in this release is the support added for exporting data to Oracle database. The default file export option is changed from CSV to Excel format.

All main settings are now displayed in snippet format in browser view’s status bar.

smarthelp

Help (videos, articles) related to the website loaded in the configuration browser is automatically loaded and displayed as a smart tip.

Miner Settings can now be opened and changed directly from the Miner window.

minersettings.png

JavaScript can now be typed in multi-line code format.

js

Browser settings now include a new option to share user location to the loaded page.

browsersetting.png

In addition to the above this release also contains minor bug fixes and improvements as always. You may download and try the latest version from https://www.webharvy.com/download.html

 

 

 

Posted in Release update, WebHarvy, WebHarvy Feature | Tagged , , , , , | Leave a comment

WebHarvy 5.1 released (Includes direct Excel Export)

The following are the changes in 5.1.0.152 :

New Features :

  1. Excel export – supports directly saving mined data as an Excel file (details)
  2. Handles page numbers in JavaScript code to load next page data (details)
  3. Updated Chromium engine from V54 to V62

Minor changes :

  1. Default values of ‘Enable Plugins’ and ‘Enable Browser Security’ in Browser Settings set to false (details)
  2. Browser address bar can be used for Google search

Bug fixes :

  1. Fixed issues related to handling headers and post data for HTTP requests
  2. Fixed issue in selecting data using mouse when Zoom-level of browser is not equal to 1 (zoomed in or zoomed out)
  3. Text formatting issues (line-breaks, spaces) in Capture window fixed
  4. Fixed issue where order of applying capture-html and capture-more-content was relevant (for applying regex to follow links or to capture images)
  5. Bug fix in editing keywords. With the previous version changing the first keyword was not possible.
  6. Minimizes memory usage in mining thread by limiting the number of browser instances created

As always, the latest version may be downloaded and installed from the following page :

https://www.webharvy.com/download.html

Posted in Release update, WebHarvy | Tagged , , , , | Leave a comment

WebHarvy based on Google Chrome Released (version 5.0.1.148)

This release comes with least bells and whistles since we have not added features or changed cosmetics of the software. But still, this is a major upgrade. The change is all internal.

WebHarvy has been using Microsoft’s Internet Explorer (IE) as its internal browser since inception. Microsoft stopped supporting IE a few years back when they introduced the Edge browser.

So WebHarvy had to switch to another solution to power its internal browser and we believe using Google’s Chrome Browser Project is the way forward. This makes WebHarvy more stable, faster and secure. Switching to Chrome also opens up the possibility of porting the software to other platforms like Mac and Linux.

You may download and install the latest version which is based on Chrome browser from the following link.

http://www.webharvy.com/webharvysetup.exe

As mentioned before the change from IE to Chrome is internal to the software and transparent to the user interface.  So, the configuration process and user interface of WebHarvy remains the same.

Minor Changes

  1. For scraping data from sites which require login, the steps have been simplified. You no longer need to login to the website separately from IE. See https://www.webharvy.com/articles/sites-requiring-login.html

  2. The ‘Internet Options’ menu option under Edit menu has been removed. Instead a new Browser options tab has been added in Settings window.

Running configuration files created using the older version which was based on IE on this new version based on Chrome

Configuration files created using the old version should normally work fine with the new version which is based on Chrome, but there will be exceptions. In such cases we recommend that you create a new configuration using the latest version.

As always, in case you have an questions or need assistance you may contact our support at https://www.webharvy.com/support.html

 

 

Posted in Release update, WebHarvy | Tagged , , , , , , | Leave a comment