Proposed Pull Request Change

title description ms.service ms.topic ms.custom author ms.author ms.reviewer ms.date
Use C# with MapReduce on Hadoop in HDInsight - Azure Learn how to use C# to create MapReduce solutions with Apache Hadoop in Azure HDInsight. azure-hdinsight how-to hdinsightactive, devx-track-csharp, devx-track-dotnet, devx-track-azurepowershell hareshg hgowrisankar nijelsf 09/06/2024
📄 Document Links
GitHub View on GitHub Microsoft Learn View on Microsoft Learn
Raw New Markdown
Generating updated version of doc...
Rendered New Markdown
Generating updated version of doc...
+0 -0
+0 -0
--- title: Use C# with MapReduce on Hadoop in HDInsight - Azure description: Learn how to use C# to create MapReduce solutions with Apache Hadoop in Azure HDInsight. ms.service: azure-hdinsight ms.topic: how-to ms.custom: hdinsightactive, devx-track-csharp, devx-track-dotnet, devx-track-azurepowershell author: hareshg ms.author: hgowrisankar ms.reviewer: nijelsf ms.date: 09/06/2024 --- # Use C# with MapReduce streaming on Apache Hadoop in HDInsight Learn how to use C# to create a MapReduce solution on HDInsight. Apache Hadoop streaming allows you to run MapReduce jobs using a script or executable. Here, .NET is used to implement the mapper and reducer for a word count solution. ## .NET on HDInsight HDInsight clusters use [Mono (https://mono-project.com)](https://mono-project.com) to run .NET applications. Mono version 4.2.1 is included with HDInsight version 3.6. For more information on the version of Mono included with HDInsight, see [Apache Hadoop components available with HDInsight versions](../hdinsight-component-versioning.md). For more information on Mono compatibility with .NET Framework versions, see [Mono compatibility](https://www.mono-project.com/docs/about-mono/compatibility/). ## How Hadoop streaming works The basic process used for streaming in this document is as follows: 1. Hadoop passes data to the mapper (*mapper.exe* in this example) on STDIN. 2. The mapper processes the data, and emits tab-delimited key/value pairs to STDOUT. 3. The output is read by Hadoop, and then passed to the reducer (*reducer.exe* in this example) on STDIN. 4. The reducer reads the tab-delimited key/value pairs, processes the data, and then emits the result as tab-delimited key/value pairs on STDOUT. 5. The output is read by Hadoop and written to the output directory. For more information on streaming, see [Hadoop Streaming](https://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html). ## Prerequisites * Visual Studio. * A familiarity with writing and building C# code that targets .NET Framework 4.5. * A way to upload .exe files to the cluster. The steps in this document use the Data Lake Tools for Visual Studio to upload the files to primary storage for the cluster. * If using PowerShell, you'll need the [Az Module](/powershell/azure/). * An Apache Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](../hadoop/apache-hadoop-linux-tutorial-get-started.md). * The URI scheme for your clusters primary storage. This scheme would be `wasb://` for Azure Storage, `abfs://` for Azure Data Lake Storage Gen2 or `adl://` for Azure Data Lake Storage Gen1. If secure transfer is enabled for Azure Storage or Data Lake Storage Gen2, the URI would be `wasbs://` or `abfss://`, respectively. ## Create the mapper In Visual Studio, create a new .NET Framework console application named *mapper*. Use the following code for the application: ```csharp using System; using System.Text.RegularExpressions; namespace mapper { class Program { static void Main(string[] args) { string line; //Hadoop passes data to the mapper on STDIN while((line = Console.ReadLine()) != null) { // We only want words, so strip out punctuation, numbers, etc. var onlyText = Regex.Replace(line, @"\.|;|:|,|[0-9]|'", ""); // Split at whitespace. var words = Regex.Matches(onlyText, @"[\w]+"); // Loop over the words foreach(var word in words) { //Emit tab-delimited key/value pairs. //In this case, a word and a count of 1. Console.WriteLine("{0}\t1",word); } } } } } ``` After you create the application, build it to produce the `/bin/Debug/mapper.exe` file in the project directory. ## Create the reducer In Visual Studio, create a new .NET Framework console application named *reducer*. Use the following code for the application: ```csharp using System; using System.Collections.Generic; namespace reducer { class Program { static void Main(string[] args) { //Dictionary for holding a count of words Dictionary<string, int> words = new Dictionary<string, int>(); string line; //Read from STDIN while ((line = Console.ReadLine()) != null) { // Data from Hadoop is tab-delimited key/value pairs var sArr = line.Split('\t'); // Get the word string word = sArr[0]; // Get the count int count = Convert.ToInt32(sArr[1]); //Do we already have a count for the word? if(words.ContainsKey(word)) { //If so, increment the count words[word] += count; } else { //Add the key to the collection words.Add(word, count); } } //Finally, emit each word and count foreach (var word in words) { //Emit tab-delimited key/value pairs. //In this case, a word and a count of 1. Console.WriteLine("{0}\t{1}", word.Key, word.Value); } } } } ``` After you create the application, build it to produce the `/bin/Debug/reducer.exe` file in the project directory. ## Upload to storage Next, you need to upload the *mapper* and *reducer* applications to HDInsight storage. 1. In Visual Studio, select **View** > **Server Explorer**. 1. Right-click **Azure**, select **Connect to Microsoft Azure Subscription...**, and complete the sign-in process. 1. Expand the HDInsight cluster that you wish to deploy this application to. An entry with the text **(Default Storage Account)** is listed. :::image type="content" source="./media/apache-hadoop-dotnet-csharp-mapreduce-streaming/hdinsight-storage-account.png" alt-text="Storage account, HDInsight cluster, Server Explorer, Visual Studio." border="true"::: * If the **(Default Storage Account)** entry can be expanded, you're using an **Azure Storage Account** as default storage for the cluster. To view the files on the default storage for the cluster, expand the entry and then double-click **(Default Container)**. * If the **(Default Storage Account)** entry can't be expanded, you're using **Azure Data Lake Storage** as the default storage for the cluster. To view the files on the default storage for the cluster, double-click the **(Default Storage Account)** entry. 1. To upload the .exe files, use one of the following methods: * If you're using an **Azure Storage Account**, select the **Upload Blob** icon. :::image type="content" source="./media/apache-hadoop-dotnet-csharp-mapreduce-streaming/hdinsight-upload-icon.png" alt-text="HDInsight upload icon for mapper, Visual Studio." border="true"::: In the **Upload New File** dialog box, under **File name**, select **Browse**. In the **Upload Blob** dialog box, go to the *bin\debug* folder for the *mapper* project, and then choose the *mapper.exe* file. Finally, select **Open** and then **OK** to complete the upload. * For **Azure Data Lake Storage**, right-click an empty area in the file listing, and then select **Upload**. Finally, select the *mapper.exe* file and then select **Open**. Once the *mapper.exe* upload has finished, repeat the upload process for the *reducer.exe* file. ## Run a job: Using an SSH session The following procedure describes how to run a MapReduce job using an SSH session: 1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command: ```cmd ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net ``` 1. Use one of the following commands to start the MapReduce job: * If the default storage is **Azure Storage**: ```bash yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \ -files wasbs:///mapper.exe,wasbs:///reducer.exe \ -mapper mapper.exe \ -reducer reducer.exe \ -input /example/data/gutenberg/davinci.txt \ -output /example/wordcountout ``` * If the default storage is **Data Lake Storage Gen1**: ```bash yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \ -files adl:///mapper.exe,adl:///reducer.exe \ -mapper mapper.exe \ -reducer reducer.exe \ -input /example/data/gutenberg/davinci.txt \ -output /example/wordcountout ``` * If the default storage is **Data Lake Storage Gen2**: ```bash yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \ -files abfs:///mapper.exe,abfs:///reducer.exe \ -mapper mapper.exe \ -reducer reducer.exe \ -input /example/data/gutenberg/davinci.txt \ -output /example/wordcountout ``` The following list describes what each parameter and option represents: |Parameter | Description | |---|---| |hadoop-streaming.jar|Specifies the jar file that contains the streaming MapReduce functionality.| |-files|Specifies the *mapper.exe* and *reducer.exe* files for this job. The `wasbs:///`, `adl:///`, or `abfs:///` protocol declaration before each file is the path to the root of default storage for the cluster.| |-mapper|Specifies the file that implements the mapper.| |-reducer|Specifies the file that implements the reducer.| |-input|Specifies the input data.| |-output|Specifies the output directory.| 1. Once the MapReduce job completes, use the following command to view the results: ```bash hdfs dfs -text /example/wordcountout/part-00000 ``` The following text is an example of the data returned by this command: ```output you 1128 young 38 younger 1 youngest 1 your 338 yours 4 yourself 34 yourselves 3 youth 17 ``` ## Run a job: Using PowerShell Use the following PowerShell script to run a MapReduce job and download the results. [!code-powershell[main](../../../powershell_scripts/hdinsight/use-csharp-mapreduce/use-csharp-mapreduce.ps1?range=5-87)] This script prompts you for the cluster login account name and password, along with the HDInsight cluster name. Once the job completes, the output is downloaded to a file named *output.txt*. The following text is an example of the data in the `output.txt` file: ```output you 1128 young 38 younger 1 youngest 1 your 338 yours 4 yourself 34 yourselves 3 youth 17 ``` ## Next steps * [Use MapReduce in Apache Hadoop on HDInsight](hdinsight-use-mapreduce.md). * [Use a C# user-defined function with Apache Hive and Apache Pig](apache-hadoop-hive-pig-udf-dotnet-csharp.md). * [Develop Java MapReduce programs](apache-hadoop-develop-deploy-java-mapreduce-linux.md)
Success! Branch created successfully. Create Pull Request on GitHub
Error: