Proposed Pull Request Change

title description ms.service ms.topic author ms.author ms.reviewer ms.date ms.custom
Use Apache Hadoop Hive with Curl in HDInsight - Azure Learn how to remotely submit Apache Pig jobs to Azure HDInsight using Curl. azure-hdinsight how-to hareshg hgowrisankar nijelsf 09/06/2024 ['hdinsightactive', 'sfi-ropc-nochange']
📄 Document Links
GitHub View on GitHub Microsoft Learn View on Microsoft Learn
Raw New Markdown
Generating updated version of doc...
Rendered New Markdown
Generating updated version of doc...
+0 -0
+0 -0
--- title: Use Apache Hadoop Hive with Curl in HDInsight - Azure description: Learn how to remotely submit Apache Pig jobs to Azure HDInsight using Curl. ms.service: azure-hdinsight ms.topic: how-to author: hareshg ms.author: hgowrisankar ms.reviewer: nijelsf ms.date: 09/06/2024 ms.custom: - hdinsightactive - sfi-ropc-nochange --- # Run Apache Hive queries with Apache Hadoop in HDInsight using REST [!INCLUDE [hive-selector](../includes/hdinsight-selector-use-hive.md)] Learn how to use the WebHCat REST API to run Apache Hive queries with Apache Hadoop on Azure HDInsight cluster. ## Prerequisites * An Apache Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](./apache-hadoop-linux-tutorial-get-started.md). * A REST client. This document uses [Invoke-WebRequest](/powershell/module/microsoft.powershell.utility/invoke-webrequest) on Windows PowerShell and [Curl](https://curl.haxx.se/) on [Bash](/windows/wsl/install-win10). * If you use Bash, you'll also need jq, a command-line JSON processor. See [https://stedolan.github.io/jq/](https://stedolan.github.io/jq/). ## Base URI for REST API The base Uniform Resource Identifier (URI) for the REST API on HDInsight is `https://CLUSTERNAME.azurehdinsight.net/api/v1/clusters/CLUSTERNAME`, where `CLUSTERNAME` is the name of your cluster. Cluster names in URIs are **case-sensitive**. While the cluster name in the fully qualified domain name (FQDN) part of the URI (`CLUSTERNAME.azurehdinsight.net`) is case-insensitive, other occurrences in the URI are case-sensitive. ## Authentication When using cURL or any other REST communication with WebHCat, you must authenticate the requests by providing the user name and password for the HDInsight cluster administrator. The REST API is secured via [basic authentication](https://en.wikipedia.org/wiki/Basic_access_authentication). To help ensure that your credentials are securely sent to the server, always make requests by using Secure HTTP (HTTPS). ### Setup (Preserve credentials) Preserve your credentials to avoid reentering them for each example. The cluster name will be preserved in a separate step. **A. Bash** Edit the script below by replacing `PASSWORD` with your actual password. Then enter the command. ```bash export PASSWORD='PASSWORD' ``` **B. PowerShell** Execute the code below and enter your credentials at the pop-up window: ```powershell $creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login" ``` ### Identify correctly cased cluster name The actual casing of the cluster name may be different than you expect, depending on how the cluster was created. The steps here will show the actual casing, and then store it in a variable for all later examples. Edit the scripts below to replace `CLUSTERNAME` with your cluster name. Then enter the command. (The cluster name for the FQDN isn't case-sensitive.) ```bash export CLUSTER_NAME=$(curl -u admin:$PASSWORD -sS -G "https://CLUSTERNAME.azurehdinsight.net/api/v1/clusters" | jq -r '.items[].Clusters.cluster_name') echo $CLUSTER_NAME ``` ```powershell # Identify properly cased cluster name $resp = Invoke-WebRequest -Uri "https://CLUSTERNAME.azurehdinsight.net/api/v1/clusters" ` -Credential $creds -UseBasicParsing $clusterName = (ConvertFrom-Json $resp.Content).items.Clusters.cluster_name; # Show cluster name $clusterName ``` ## Run a Hive query 1. To verify that you can connect to your HDInsight cluster, use one of the following commands: ```bash curl -u admin:$PASSWORD -G https://$CLUSTER_NAME.azurehdinsight.net/templeton/v1/status ``` ```powershell $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/templeton/v1/status" ` -Credential $creds ` -UseBasicParsing $resp.Content ``` You receive a response similar to the following text: ```json {"status":"ok","version":"v1"} ``` The parameters used in this command are as follows: * `-u` - The user name and password used to authenticate the request. * `-G` - Indicates that this request is a GET operation. 1. The beginning of the URL, `https://$CLUSTERNAME.azurehdinsight.net/templeton/v1`, is the same for all requests. The path, `/status`, indicates that the request is to return a status of WebHCat (also known as Templeton) for the server. You can also request the version of Hive by using the following command: ```bash curl -u admin:$PASSWORD -G https://$CLUSTER_NAME.azurehdinsight.net/templeton/v1/version/hive ``` ```powershell $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/templeton/v1/version/hive" ` -Credential $creds ` -UseBasicParsing $resp.Content ``` This request returns a response similar to the following text: ```json {"module":"hive","version":"1.2.1000.2.6.5.3008-11"} ``` 1. Use the following to create a table named **log4jLogs**: ```bash JOB_ID=$(curl -s -u admin:$PASSWORD -d user.name=admin -d execute="DROP+TABLE+log4jLogs;CREATE+EXTERNAL+TABLE+log4jLogs(t1+string,t2+string,t3+string,t4+string,t5+string,t6+string,t7+string)+ROW+FORMAT+DELIMITED+FIELDS+TERMINATED+BY+' '+STORED+AS+TEXTFILE+LOCATION+'/example/data/';SELECT+t4+AS+sev,COUNT(*)+AS+count+FROM+log4jLogs+WHERE+t4+=+'[ERROR]'+AND+INPUT__FILE__NAME+LIKE+'%25.log'+GROUP+BY+t4;" -d statusdir="/example/rest" https://$CLUSTER_NAME.azurehdinsight.net/templeton/v1/hive | jq -r .id) echo $JOB_ID ``` ```powershell $reqParams = @{"user.name"="admin";"execute"="DROP TABLE log4jLogs;CREATE EXTERNAL TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED BY ' ' STORED AS TEXTFILE LOCATION '/example/data/;SELECT t4 AS sev,COUNT(*) AS count FROM log4jLogs WHERE t4 = '[ERROR]' GROUP BY t4;";"statusdir"="/example/rest"} $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/templeton/v1/hive" ` -Credential $creds ` -Body $reqParams ` -Method POST ` -UseBasicParsing $jobID = (ConvertFrom-Json $resp.Content).id $jobID ``` This request uses the POST method, which sends data as part of the request to the REST API. The following data values are sent with the request: * `user.name` - The user that is running the command. * `execute` - The HiveQL statements to execute. * `statusdir` - The directory that the status for this job is written to. These statements perform the following actions: * `DROP TABLE` - If the table already exists, it's deleted. * `CREATE EXTERNAL TABLE` - Creates a new 'external' table in Hive. External tables store only the table definition in Hive. The data is left in the original location. > [!NOTE] > External tables should be used when you expect the underlying data to be updated by an external source. For example, an automated data upload process or another MapReduce operation. > > Dropping an external table does **not** delete the data, only the table definition. * `ROW FORMAT` - How the data is formatted. The fields in each log are separated by a space. * `STORED AS TEXTFILE LOCATION` - Where the data is stored (the example/data directory) and that it's stored as text. * `SELECT` - Selects a count of all rows where column **t4** contains the value **[ERROR]**. This statement returns a value of **3** as there are three rows that contain this value. > [!NOTE] > Notice that the spaces between HiveQL statements are replaced by the `+` character when used with Curl. Quoted values that contain a space, such as the delimiter, should not be replaced by `+`. This command returns a job ID that can be used to check the status of the job. 1. To check the status of the job, use the following command: ```bash curl -u admin:$PASSWORD -d user.name=admin -G https://$CLUSTER_NAME.azurehdinsight.net/templeton/v1/jobs/$jobid | jq .status.state ``` ```powershell $reqParams=@{"user.name"="admin"} $resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/templeton/v1/jobs/$jobID" ` -Credential $creds ` -Body $reqParams ` -UseBasicParsing # ConvertFrom-JSON can't handle duplicate names with different case # So change one to prevent the error $fixDup=$resp.Content.Replace("jobID","job_ID") (ConvertFrom-Json $fixDup).status.state ``` If the job has finished, the state is **SUCCEEDED**. 1. Once the state of the job has changed to **SUCCEEDED**, you can retrieve the results of the job from Azure Blob storage. The `statusdir` parameter passed with the query contains the location of the output file; in this case, `/example/rest`. This address stores the output in the `example/curl` directory in the clusters default storage. You can list and download these files by using the [Azure CLI](/cli/azure/install-azure-cli). For more information, see [Use Azure CLI with Azure Storage](../../storage/blobs/storage-quickstart-blobs-cli.md). ## Next steps For information on other ways you can work with Hadoop on HDInsight: * [Use Apache Hive with Apache Hadoop on HDInsight](hdinsight-use-hive.md) * [Use MapReduce with Apache Hadoop on HDInsight](hdinsight-use-mapreduce.md) For more information on the REST API used in this document, see the [WebHCat reference](https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference) document.
Success! Branch created successfully. Create Pull Request on GitHub
Error: